WLAN: towards a solution for the roaming freeze
Werner Almesberger
werner at openmoko.org
Thu Jan 22 05:47:01 CET 2009
I had some success in trying to reproduce the roaming problems of
our WLAN. I'm used an automated experiment where the access point's
channel is being changed and then the Neo tries to ping the access
point.
It seems that the number one culprit is a bug that causes the
firmware to crash.
Scripts to run the experiment and to analyze the data are here:
http://svn.openmoko.org/developers/werner/wlan/freeze/
An example log is here:
http://people.openmoko.org/werner/wlan-freeze/4.bz2
Some statistics of what's in this file:
- 308 frequency changes, of which
- 244 (79%) happened smoothly, most taking only a few seconds,
including the connectivity test. A few took much longer, but
that's mainly because some packet got delayed and ping
overestimated the RTT.
- 53 (17%) times, the module returned to the previous frequency,
and either happily continued there (thanks to frequency leakage,
it seems) or noticed the error after a while and then re-associated
again, this time getting it right.
- once it got completely confused and picked a frequency that was
neither the old nor the new one, but somehow managed to struggle on
- in ten cases (3%) something more sinister happend. It failed to
return at least 100 good pings within 120 seconds after the
frequency change. In these ten cases,
- two could be recovered (*) by issuing an iwconfig to force
re-association
- one was resolved (*) by doing an "iwlist scan", perhaps reminding
the module that it was sitting at the wrong frequency
- in one mystery case, nothing looked out of place, yet communication
stayed dead and only a module reset could fix it. Perhaps waiting
a bit longer could have helped. (I didn't analyze timing and ping
performance yet.)
- in six cases, 2% of all frequency changes, it seems that the
firmware crashed :-( This caused the familiar station list with
just one station. I sent a register dump to Atheros.
(*) Perhaps the problem would also have disappeared just by waiting
and the recovery action we took was in fact irrelevant. My
script also uses "wait and see" as a recovery strategy, but I
need a lot more data points before I can tell whether waiting
alone is - at least technically - good enough.
Conclusions so far:
- we end up at the wrong frequency a lot more often than I would have
expected, and even if two channels away from what the access point
is using (*), things may not look bad at first sight. This needs
more analysis.
(*) At least I hope it is using the right frequencies. I don't have
a spectrum analyzer, so I can't verify what really happens.
- some roaming problems may be caused by frequency leakage making the
WLAN module think all it well but in fact, it barely manages to
get enough communication done to maintin this illusion. Also this
still needs a quantitative analysis.
- there's a firmware bug that may cause about half of all roaming
problems. We could work around this by resetting the module when it
trips, but I hope we can find a better solution.
I'm now running a longer experiment that should turn up more data.
- Werner
More information about the devel
mailing list