What now? How to proceed debugging this GSM/kernel problem?

Mike (mwester) mwester at dls.net
Sun Apr 20 07:19:11 CEST 2008


  So close.  So very close...

I've been expending much effort in fixing up the many suspend/resume 
issues that have been crippling the poor GTA01.  At this point, the 
major ones are all resolved, but I've encountered a difficult one.

So I have questions.  This email is long, so for those with shorter 
attention spans I'll cut to the chase:

a) Where is the list of known problems with the different GSM modem 
firmwares, and the list of fixes in each firmware?

b) Is there a document that describes what we (or anyone else, for that 
matter) know about the problems or unexpected behavior of our version of 
the S3C2410 SoC UARTs, especially with regard to auto flow control and 
overruns?

c) What are the real-world parameters for the GSM modem with regard to 
flow-control:

   1) How much data can it buffer internally?

   2) What does it do when its internal buffers fill -- does it throw 
data away?  Does it transmit despite being flow-controlled?  Is there an 
  AT command available to alter or control how it behaves?  Can it 
report internal buffer overflows?  What could trigger it to transmit 
despite being flow-controlled?

   3) Do different firmware versions for this modem behave differently 
in regard to flow control, and if so, how so?

d) If I were to acquire a scope, are the CTS, RTS, and other signals 
between the GSM and the UART accessible on the debug board?  Are there 
pads on the phone itself?

e) Can we do anything with gsmd, and/or phonekit, to work around the 
problem in the meantime?


See below for details on the issue.


-----------------------------


The device works now - most of the time.  Once in a while, when waking 
from suspend on an incoming SMS or call, the device encounters an 
overrun error on the serial port.

This is a Very Bad Thing, but it's made even worse since it does not 
seem as if gsmd can recover, and the app (phonekit?) that runs the show 
can't seem to recover when gsmd barfs.  So this chain of errors leaves 
the device dead, and you don't know when this has happened.

Here's a trace from some kernel debugging code.  This trace shows a good 
exchange waking to receive an SMS, followed by a failed attempt:

4481.625: ->  .
4481.630: <- ..OK..
4481.635: -> AT+CMGD=1,0.
4481.725: <- ..OK..

Normal communication with the GSM; we send a newline, modem says "OK", 
we say "AT+CMGD=1,0", modem says "OK" -- you get the idea.  (The numbers 
on the left represent seconds since some distant point in time; 
resolution is in steps of .005 seconds.)

4511.315: Ps SijSIJG

Normal suspend occurs here - this jibberish means something, but its all 
good -- this is what we expect and want to see.

4511.325: Pr GSi
4511.420: S  M
4511.420: Pr JSij
4511.420: <- ..%CPRI: 1,2....+CMTI: "SM",1..

Normal resume here - reverse order from the suspend, the "S  M" line 
that is interjected simply marks a debug point I put in that proves that 
hardware flow control is being turned back on during the resume.  So we 
know that the hardware flow control is enabled during the resume 
sequence.  The last line shows the reason for the resume: we have an 
incoming SMS.

4513.555: ->  .
4513.560: <- ..OK..
4513.685: -> AT+CMGR=1.
4513.810: <- ..+CMGR: 0,,33..07912160130300F4040B918174570101F00003804091
4513.815: <- 3231010A0FF3701E744787E92C10B94E2EFF00....OK..
4513.935: ->  .
4513.940: <- ..OK..
4513.945: -> AT+CMGD=1,0.
4514.020: <- ..OK..

The SMS is read and handled.  That's all good.

Now compare that to the next entries in the log.  We suspend, as before. 
  We resume, as before.  The reason for the resume is clear (incoming 
data from the modem).  But notice the "E# O" line -- that signifies that 
at that point in the data stream the S3C2410 UART reported an Overrun 
error.  The data that follows seems to confirm that -- if you count the 
characters, you'll come up with 16 characters, and you'll notice that 
the CMTI message is truncated (one character short; the message is 
supposed to be terminated by "\r\n", the "\n" is missing (the 
non-printing characters are replaced by a period in the log, but I went 
and checked what's really there in the binary data)).

4593.280: Ps SijSIJG
4593.290: Pr GSi
4593.385: S  M
4593.385: Pr JSij
4593.385: <- ..%CPRI: 1,2..
4593.610: E# O
4593.610: <- ..+CMTI: "SM",1.
4704.930: ->  .
4704.935: <- ..OK..
4707.935: -> AT.
4707.940: <- ..OK..

So that's it.  At this point, the phone is dead.  The GSM is working, 
and you'll see some data from it from time to time.  In fact, if you 
look at the time stamps, you can see that it appears that gsmd sent 
something to the device 100 seconds later, and the GSM responded -- but 
gsmd didn't see it, and terminated; it declared the modem dead!  And 
when that happens, then whatever is running the GUI on the phone reports 
an I/O error and doesn't even attempt to restart gsmd.  Aaargh!!  So 
this simple overrun results in total failure of the phone parts of the 
device!  [Log snippets from both of those will be tacked to the end of 
this email.]

----

So, first of all we have a major problem down in the kernel -- there's 
no reason for an overflow to have occurred.  Auto-flow-control was 
enabled, and confirmed to be enabled by reading the status back from the 
UART (that's the gibberish parts you see up there).  We can see exactly 
where the 16-byte FIFO filled up and the overrun occurred, but there's 
no reason for it do have done so unless there's a hardware issue with 
the UART, or there's a bug in the GSM itself.

Second, there's no reason that gsmd can't deal with this.  It should 
have recovered from a short read, and re-established synchronization 
with the modem (it's not dead -- use "cu" to connect at this point and 
you can talk to it all day; it's clearly gsmd that got confused and 
can't talk!).

Thirdly, when gsmd exits, it was my understanding that the dialer or 
whatever the "Home" application is would restart it; that doesn't seem 
to happen.

----

Logs follow:

==============================================================

Sat Apr 19 20:49:37 2008 <1> machine_gta01.c:142:atcmd_wakeup_modem() 
try to wake up
Sat Apr 19 20:49:37 2008 <1> atcmd.c:654:atcmd_submit() submitting 
command `AT+CMGD=1,0'
Sat Apr 19 20:49:37 2008 <1> machine_gta01.c:134:wakeup_timer() Create 
wake up timer
Sat Apr 19 20:49:37 2008 <1> atcmd.c:265:ml_parse() buf=`OK'(2)
Sat Apr 19 20:49:37 2008 <1> atcmd.c:634:remove_timer() Get respond 
before timeout, remove timer!
Sat Apr 19 20:49:37 2008 <1> atcmd.c:227:atcmd_done() Calling final 
cmd->cb()
Sat Apr 19 20:49:37 2008 <1> machine_gta01.c:88:null_wakeup_cb() The 
wake up callback!!
Sat Apr 19 20:49:37 2008 <1> atcmd.c:237:atcmd_done() Clearing mlbuf
Sat Apr 19 20:49:37 2008 <1> atcmd.c:593:discard_timer() Create discard 
timer
Sat Apr 19 20:49:37 2008 <1> atcmd.c:265:ml_parse() buf=`OK'(2)
Sat Apr 19 20:49:37 2008 <1> atcmd.c:634:remove_timer() Get respond 
before timeout, remove timer!
Sat Apr 19 20:49:37 2008 <1> atcmd.c:227:atcmd_done() Calling final 
cmd->cb()
Sat Apr 19 20:49:37 2008 <1> usock.c:56:usock_cmd_enqueue() enqueueing 
usock cmd 0x31ed8 for user 0x31fa8
Sat Apr 19 20:49:37 2008 <1> atcmd.c:237:atcmd_done() Clearing mlbuf
Sat Apr 19 20:49:37 2008 <1> usock.c:1665:gsmd_usock_user_cb() 
successfully sent cmd 0x31ed8 to user 0x31fa8, freeing
Sat Apr 19 20:51:23 2008 <1> atcmd.c:265:ml_parse() buf=`%CPRI: 1,2'(10)
Sat Apr 19 20:51:23 2008 <1> atcmd.c:334:ml_parse() extd reply `%CPRI: 
1,2' to cmd `NONE', must be unsolicited
Sat Apr 19 20:51:23 2008 <1> unsolicited.c:70:usock_evt_send() entering 
evt=13
Sat Apr 19 20:51:23 2008 <1> usock.c:56:usock_cmd_enqueue() enqueueing 
usock cmd 0x320b0 for user 0x31e70
Sat Apr 19 20:51:23 2008 <1> usock.c:56:usock_cmd_enqueue() enqueueing 
usock cmd 0x3b4d8 for user 0x31fa8
Sat Apr 19 20:51:23 2008 <1> usock.c:1665:gsmd_usock_user_cb() 
successfully sent cmd 0x3b4d8 to user 0x31fa8, freeing
Sat Apr 19 20:51:23 2008 <1> usock.c:1665:gsmd_usock_user_cb() 
successfully sent cmd 0x320b0 to user 0x31e70, freeing
Sat Apr 19 20:53:13 2008 <1> gsmd.c:116:alive_interval_tmr_cb() interval 
expired, starting next alive inquiry
Sat Apr 19 20:53:13 2008 <1> atcmd.c:651:atcmd_submit() extra-submiting 
command
Sat Apr 19 20:53:13 2008 <1> machine_gta01.c:142:atcmd_wakeup_modem() 
try to wake up
Sat Apr 19 20:53:13 2008 <1> atcmd.c:654:atcmd_submit() submitting 
command `AT'
Sat Apr 19 20:53:13 2008 <1> machine_gta01.c:134:wakeup_timer() Create 
wake up timer
Sat Apr 19 20:53:16 2008 <1> machine_gta01.c:96:wakeup_timeout() Wakeup 
time out!!
Sat Apr 19 20:53:46 2008 <1> gsmd.c:75:alive_tmr_cb() gsmd_alive timer 
expired
Sat Apr 19 20:53:46 2008 <8> gsmd.c:78:alive_tmr_cb() modem dead!


==============================================================

** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset

(openmoko-messages:2760): Pango-WARNING **: 
pango_layout_set_markup_with_accel: Error on line 1: Character ' ' is 
not valid at the start of an entity name; the & character begins an 
entity; if this ampersand isn't supposed to be an entity, escape it as &amp;
** (matchbox-panel-2:2474): DEBUG: battery_applet: signal_filter
** (matchbox-panel-2:2474): DEBUG: (unknown dbus message, ignoring)
** (matchbox-panel-2:2474): DEBUG: usb_applet: signal_filter
** (matchbox-panel-2:2474): DEBUG: (unknown dbus message, ignoring)
** (neod:2470): DEBUG: power pressed for 0
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: requested to set display brightness 100
** (neod:2470): DEBUG: writing '5156' to 
'/sys/class/backlight/gta01-bl/brightness'
** (neod:2470): DEBUG: active Window = 20971523 ('Messages')

(neod:2470): Gtk-CRITICAL **: gtk_window_is_active: assertion 
`GTK_IS_WINDOW (window)' failed
** (matchbox-panel-2:2474): DEBUG: CIPHER event: data = 62944, 2, 0
** (matchbox-panel-2:2474): DEBUG: gsm_applet_update_cipher_status: 
status = 62944
** (matchbox-panel-2:2474): DEBUG: battery_applet: timeout
** (matchbox-panel-2:2474): DEBUG: battery_applet: timeout
** (matchbox-panel-2:2474): DEBUG: battery_applet: timeout

** (phone-kit:2472): WARNING **: Unhandled IO condition, bailing

==============================================================




More information about the openmoko-kernel mailing list