WLAN: towards a solution for the roaming freeze

Werner Almesberger werner at openmoko.org
Thu Jan 22 05:47:01 CET 2009


I had some success in trying to reproduce the roaming problems of
our WLAN. I'm used an automated experiment where the access point's
channel is being changed and then the Neo tries to ping the access
point.

It seems that the number one culprit is a bug that causes the
firmware to crash.

Scripts to run the experiment and to analyze the data are here:
http://svn.openmoko.org/developers/werner/wlan/freeze/

An example log is here:
http://people.openmoko.org/werner/wlan-freeze/4.bz2

Some statistics of what's in this file:

- 308 frequency changes, of which

- 244 (79%) happened smoothly, most taking only a few seconds,
  including the connectivity test. A few took much longer, but
  that's mainly because some packet got delayed and ping
  overestimated the RTT.

- 53 (17%) times, the module returned to the previous frequency,
  and either happily continued there (thanks to frequency leakage,
  it seems) or noticed the error after a while and then re-associated
  again, this time getting it right.

- once it got completely confused and picked a frequency that was
  neither the old nor the new one, but somehow managed to struggle on

- in ten cases (3%) something more sinister happend. It failed to
  return at least 100 good pings within 120 seconds after the
  frequency change. In these ten cases,

- two could be recovered (*) by issuing an iwconfig to force
  re-association

- one was resolved (*) by doing an "iwlist scan", perhaps reminding
  the module that it was sitting at the wrong frequency

- in one mystery case, nothing looked out of place, yet communication
  stayed dead and only a module reset could fix it. Perhaps waiting
  a bit longer could have helped. (I didn't analyze timing and ping
  performance yet.)
  
- in six cases, 2% of all frequency changes, it seems that the
  firmware crashed :-( This caused the familiar station list with
  just one station. I sent a register dump to Atheros.

(*) Perhaps the problem would also have disappeared just by waiting
    and the recovery action we took was in fact irrelevant. My
    script also uses "wait and see" as a recovery strategy, but I
    need a lot more data points before I can tell whether waiting
    alone is - at least technically - good enough.

Conclusions so far:

- we end up at the wrong frequency a lot more often than I would have
  expected, and even if two channels away from what the access point
  is using (*), things may not look bad at first sight. This needs
  more analysis.

  (*) At least I hope it is using the right frequencies. I don't have
      a spectrum analyzer, so I can't verify what really happens.

- some roaming problems may be caused by frequency leakage making the
  WLAN module think all it well but in fact, it barely manages to
  get enough communication done to maintin this illusion. Also this
  still needs a quantitative analysis.

- there's a firmware bug that may cause about half of all roaming
  problems. We could work around this by resetting the module when it
  trips, but I hope we can find a better solution.

I'm now running a longer experiment that should turn up more data.

- Werner



More information about the devel mailing list