GSoC Project Status Update 04: Speech Recognition in Openmoko

saurabh gupta saurabhgupta1403 at
Sun Jun 29 23:34:04 CEST 2008

Hello Asheesh,

On Mon, Jun 30, 2008 at 2:41 AM, Asheesh Laroia <openmoko at>

> On Sun, 29 Jun 2008, saurabh gupta wrote:
> >    Besides this, some modification is being done in the noise rejection
> > part since it can degrade the performance wildly. I will use zero
> crossing
> > rate and short term energy algorithm for end point detection. My model
> will
> > also use left to right HMM. But for properly training HMM models one
> needs
> > more than one training sequence. It means that in speaker dependent
> > recognition, for training any word, one needs to utter the same word two
> or
> > three times so that proper modeling of HMM parameters take place. When
> more
> > than one training sequence is used for training, baum welch or K-means
> > segmental method gives better modeling of HMM  parameters.
> The training problem is interesting.  Here is my idea; please let me know
> if it's bogus:
> The user utters a phrase and the HMM classifies it as meaning something.
> We can wait a short while to see if the user does something to indicate
> that this classification is incorrect.  If there is no such action, and if
> the HMM had low confidence of its classification, train it on the
> utterance just issued so that next time it would be more confident (and
> presumably catch further variants).

> Obviously, there is the danger of over-training.  It seems we can mitigate
> that through (1) our detection that the utterance was correctly classified
> by the HMM, given that the user didn't do anything to correct it, and (2)
> perhaps limiting the system to only do this re-training if the counter of
> how many training data have been used for this particular classification
> is below some constant.  That constant could decay over time, for example,
> to allow us to gently migrate to varying patterns (and so that if a phone
> transfers owners it would gracefully switch to the new patterns).

You have identified the correct and justified problem in training. I thought
to handle it in this way. Whenever a user runs this application, the GUI for
speech recognition will ask it to go in training or recognition mode. In
training mode, after uttering a word, the GUI will again ask the user to
utter the same word again and so on. The user will have to feed the training
word three times (I have assumed that constant to be three) to fully create
a word in the vocabulary. If the user terminates the application or
mishandles it before three sequences, the application will not save the
word. However there is no easy way to detect the mishandling since if the
user neither  terminates the application nor speaks training word again,
application can pick the louder noise thinking it as the training word and
wrong result will be produced. This is always a bigger problem in speech
related applications since environment noise as well as end point detection
is quite difficult in real world scenario.

> Thoughts?
> > Next To Do:
> > 1)Porting the whole code on openmoko platform:
> > 2)testing with real adc channel of Freerunner
> > 3)Proper testing of noise handling and recognition on freerunner
> Your Next To Do list looks pretty great and full enough even without my
> suggestion, but I'm still curious what you and others think. (-:
> -- Asheesh.
> --
> The chief cause of problems is solutions.
>                -- Eric Sevareid
> _______________________________________________
> Openmoko community mailing list
> community at

Saurabh Gupta
Electronics and Communication Engg.
NSIT,New Delhi, India
"Problem is something which does have a solution, else it is called
-------------- next part --------------
An HTML attachment was scrubbed...

More information about the community mailing list