What now? How to proceed debugging this GSM/kernel problem?
Mike (mwester)
mwester at dls.net
Sun Apr 20 07:19:11 CEST 2008
So close. So very close...
I've been expending much effort in fixing up the many suspend/resume
issues that have been crippling the poor GTA01. At this point, the
major ones are all resolved, but I've encountered a difficult one.
So I have questions. This email is long, so for those with shorter
attention spans I'll cut to the chase:
a) Where is the list of known problems with the different GSM modem
firmwares, and the list of fixes in each firmware?
b) Is there a document that describes what we (or anyone else, for that
matter) know about the problems or unexpected behavior of our version of
the S3C2410 SoC UARTs, especially with regard to auto flow control and
overruns?
c) What are the real-world parameters for the GSM modem with regard to
flow-control:
1) How much data can it buffer internally?
2) What does it do when its internal buffers fill -- does it throw
data away? Does it transmit despite being flow-controlled? Is there an
AT command available to alter or control how it behaves? Can it
report internal buffer overflows? What could trigger it to transmit
despite being flow-controlled?
3) Do different firmware versions for this modem behave differently
in regard to flow control, and if so, how so?
d) If I were to acquire a scope, are the CTS, RTS, and other signals
between the GSM and the UART accessible on the debug board? Are there
pads on the phone itself?
e) Can we do anything with gsmd, and/or phonekit, to work around the
problem in the meantime?
See below for details on the issue.
-----------------------------
The device works now - most of the time. Once in a while, when waking
from suspend on an incoming SMS or call, the device encounters an
overrun error on the serial port.
This is a Very Bad Thing, but it's made even worse since it does not
seem as if gsmd can recover, and the app (phonekit?) that runs the show
can't seem to recover when gsmd barfs. So this chain of errors leaves
the device dead, and you don't know when this has happened.
Here's a trace from some kernel debugging code. This trace shows a good
exchange waking to receive an SMS, followed by a failed attempt:
4481.625: -> .
4481.630: <- ..OK..
4481.635: -> AT+CMGD=1,0.
4481.725: <- ..OK..
Normal communication with the GSM; we send a newline, modem says "OK",
we say "AT+CMGD=1,0", modem says "OK" -- you get the idea. (The numbers
on the left represent seconds since some distant point in time;
resolution is in steps of .005 seconds.)
4511.315: Ps SijSIJG
Normal suspend occurs here - this jibberish means something, but its all
good -- this is what we expect and want to see.
4511.325: Pr GSi
4511.420: S M
4511.420: Pr JSij
4511.420: <- ..%CPRI: 1,2....+CMTI: "SM",1..
Normal resume here - reverse order from the suspend, the "S M" line
that is interjected simply marks a debug point I put in that proves that
hardware flow control is being turned back on during the resume. So we
know that the hardware flow control is enabled during the resume
sequence. The last line shows the reason for the resume: we have an
incoming SMS.
4513.555: -> .
4513.560: <- ..OK..
4513.685: -> AT+CMGR=1.
4513.810: <- ..+CMGR: 0,,33..07912160130300F4040B918174570101F00003804091
4513.815: <- 3231010A0FF3701E744787E92C10B94E2EFF00....OK..
4513.935: -> .
4513.940: <- ..OK..
4513.945: -> AT+CMGD=1,0.
4514.020: <- ..OK..
The SMS is read and handled. That's all good.
Now compare that to the next entries in the log. We suspend, as before.
We resume, as before. The reason for the resume is clear (incoming
data from the modem). But notice the "E# O" line -- that signifies that
at that point in the data stream the S3C2410 UART reported an Overrun
error. The data that follows seems to confirm that -- if you count the
characters, you'll come up with 16 characters, and you'll notice that
the CMTI message is truncated (one character short; the message is
supposed to be terminated by "\r\n", the "\n" is missing (the
non-printing characters are replaced by a period in the log, but I went
and checked what's really there in the binary data)).
4593.280: Ps SijSIJG
4593.290: Pr GSi
4593.385: S M
4593.385: Pr JSij
4593.385: <- ..%CPRI: 1,2..
4593.610: E# O
4593.610: <- ..+CMTI: "SM",1.
4704.930: -> .
4704.935: <- ..OK..
4707.935: -> AT.
4707.940: <- ..OK..
So that's it. At this point, the phone is dead. The GSM is working,
and you'll see some data from it from time to time. In fact, if you
look at the time stamps, you can see that it appears that gsmd sent
something to the device 100 seconds later, and the GSM responded -- but
gsmd didn't see it, and terminated; it declared the modem dead! And
when that happens, then whatever is running the GUI on the phone reports
an I/O error and doesn't even attempt to restart gsmd. Aaargh!! So
this simple overrun results in total failure of the phone parts of the
device! [Log snippets from both of those will be tacked to the end of
this email.]
----
So, first of all we have a major problem down in the kernel -- there's
no reason for an overflow to have occurred. Auto-flow-control was
enabled, and confirmed to be enabled by reading the status back from the
UART (that's the gibberish parts you see up there). We can see exactly
where the 16-byte FIFO filled up and the overrun occurred, but there's
no reason for it do have done so unless there's a hardware issue with
the UART, or there's a bug in the GSM itself.
Second, there's no reason that gsmd can't deal with this. It should
have recovered from a short read, and re-established synchronization
with the modem (it's not dead -- use "cu" to connect at this point and
you can talk to it all day; it's clearly gsmd that got confused and
can't talk!).
Thirdly, when gsmd exits, it was my understanding that the dialer or
whatever the "Home" application is would restart it; that doesn't seem
to happen.
----
Logs follow:
==============================================================
Sat Apr 19 20:49:37 2008 <1> machine_gta01.c:142:atcmd_wakeup_modem()
try to wake up
Sat Apr 19 20:49:37 2008 <1> atcmd.c:654:atcmd_submit() submitting
command `AT+CMGD=1,0'
Sat Apr 19 20:49:37 2008 <1> machine_gta01.c:134:wakeup_timer() Create
wake up timer
Sat Apr 19 20:49:37 2008 <1> atcmd.c:265:ml_parse() buf=`OK'(2)
Sat Apr 19 20:49:37 2008 <1> atcmd.c:634:remove_timer() Get respond
before timeout, remove timer!
Sat Apr 19 20:49:37 2008 <1> atcmd.c:227:atcmd_done() Calling final
cmd->cb()
Sat Apr 19 20:49:37 2008 <1> machine_gta01.c:88:null_wakeup_cb() The
wake up callback!!
Sat Apr 19 20:49:37 2008 <1> atcmd.c:237:atcmd_done() Clearing mlbuf
Sat Apr 19 20:49:37 2008 <1> atcmd.c:593:discard_timer() Create discard
timer
Sat Apr 19 20:49:37 2008 <1> atcmd.c:265:ml_parse() buf=`OK'(2)
Sat Apr 19 20:49:37 2008 <1> atcmd.c:634:remove_timer() Get respond
before timeout, remove timer!
Sat Apr 19 20:49:37 2008 <1> atcmd.c:227:atcmd_done() Calling final
cmd->cb()
Sat Apr 19 20:49:37 2008 <1> usock.c:56:usock_cmd_enqueue() enqueueing
usock cmd 0x31ed8 for user 0x31fa8
Sat Apr 19 20:49:37 2008 <1> atcmd.c:237:atcmd_done() Clearing mlbuf
Sat Apr 19 20:49:37 2008 <1> usock.c:1665:gsmd_usock_user_cb()
successfully sent cmd 0x31ed8 to user 0x31fa8, freeing
Sat Apr 19 20:51:23 2008 <1> atcmd.c:265:ml_parse() buf=`%CPRI: 1,2'(10)
Sat Apr 19 20:51:23 2008 <1> atcmd.c:334:ml_parse() extd reply `%CPRI:
1,2' to cmd `NONE', must be unsolicited
Sat Apr 19 20:51:23 2008 <1> unsolicited.c:70:usock_evt_send() entering
evt=13
Sat Apr 19 20:51:23 2008 <1> usock.c:56:usock_cmd_enqueue() enqueueing
usock cmd 0x320b0 for user 0x31e70
Sat Apr 19 20:51:23 2008 <1> usock.c:56:usock_cmd_enqueue() enqueueing
usock cmd 0x3b4d8 for user 0x31fa8
Sat Apr 19 20:51:23 2008 <1> usock.c:1665:gsmd_usock_user_cb()
successfully sent cmd 0x3b4d8 to user 0x31fa8, freeing
Sat Apr 19 20:51:23 2008 <1> usock.c:1665:gsmd_usock_user_cb()
successfully sent cmd 0x320b0 to user 0x31e70, freeing
Sat Apr 19 20:53:13 2008 <1> gsmd.c:116:alive_interval_tmr_cb() interval
expired, starting next alive inquiry
Sat Apr 19 20:53:13 2008 <1> atcmd.c:651:atcmd_submit() extra-submiting
command
Sat Apr 19 20:53:13 2008 <1> machine_gta01.c:142:atcmd_wakeup_modem()
try to wake up
Sat Apr 19 20:53:13 2008 <1> atcmd.c:654:atcmd_submit() submitting
command `AT'
Sat Apr 19 20:53:13 2008 <1> machine_gta01.c:134:wakeup_timer() Create
wake up timer
Sat Apr 19 20:53:16 2008 <1> machine_gta01.c:96:wakeup_timeout() Wakeup
time out!!
Sat Apr 19 20:53:46 2008 <1> gsmd.c:75:alive_tmr_cb() gsmd_alive timer
expired
Sat Apr 19 20:53:46 2008 <8> gsmd.c:78:alive_tmr_cb() modem dead!
==============================================================
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: mainmenu powersave reset
(openmoko-messages:2760): Pango-WARNING **:
pango_layout_set_markup_with_accel: Error on line 1: Character ' ' is
not valid at the start of an entity name; the & character begins an
entity; if this ampersand isn't supposed to be an entity, escape it as &
** (matchbox-panel-2:2474): DEBUG: battery_applet: signal_filter
** (matchbox-panel-2:2474): DEBUG: (unknown dbus message, ignoring)
** (matchbox-panel-2:2474): DEBUG: usb_applet: signal_filter
** (matchbox-panel-2:2474): DEBUG: (unknown dbus message, ignoring)
** (neod:2470): DEBUG: power pressed for 0
** (neod:2470): DEBUG: mainmenu powersave reset
** (neod:2470): DEBUG: requested to set display brightness 100
** (neod:2470): DEBUG: writing '5156' to
'/sys/class/backlight/gta01-bl/brightness'
** (neod:2470): DEBUG: active Window = 20971523 ('Messages')
(neod:2470): Gtk-CRITICAL **: gtk_window_is_active: assertion
`GTK_IS_WINDOW (window)' failed
** (matchbox-panel-2:2474): DEBUG: CIPHER event: data = 62944, 2, 0
** (matchbox-panel-2:2474): DEBUG: gsm_applet_update_cipher_status:
status = 62944
** (matchbox-panel-2:2474): DEBUG: battery_applet: timeout
** (matchbox-panel-2:2474): DEBUG: battery_applet: timeout
** (matchbox-panel-2:2474): DEBUG: battery_applet: timeout
** (phone-kit:2472): WARNING **: Unhandled IO condition, bailing
==============================================================
More information about the openmoko-kernel
mailing list