speech -> text on FR?

Brandon Kruse bkruse at openmoko.org
Mon Jun 16 06:00:27 CEST 2008


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Dan Staley wrote:
| I actually just interfaced with the Sphinx project at one of the
| research positions I hold.  It is actually a very well written interface
| (for the most part...there were a few things poorly documented and/or
| implemented) But anyway, I found the java version of the project (Sphinx
| 4 http://cmusphinx.sourceforge.net/sphinx4/ ) to be pretty easy to
| build/interface with.
|
| The benefit of using the HMMs and models and methods that Sphinx
| implements is that anyone in their programs should be able to specify a
| grammar (similar to a simplified regex) that they want to be recognized
| and then the interpreter should be able to be user independant...meaning
| anyone can speak the phrase into the phone and get the desired output.
| Speech training wouldn't be required.  I found that once you set it up
| correctly, the Sphinx engine is very powerful, and usually identifies
| the spoken words no matter who says them (we found it even seemed to
| work decently well with a variety different accents).
|
| -Dan Staley
|
| On Sun, 2008-06-15 at 19:07 -0400, Ajit Natarajan wrote:
|> Hello,
|>
|> I know nothing about speech recognition, so if the following won't work,
|> please let me know (gently :) ).
|>
|> I understand that there is a project called Sphinx in CMU which attempts
|> speech recognition.  It seems pretty complex.  I couldn't get it to work
|> on my Linux desktop.  I'm not sure if it would work on an FR since it
|> may need a lot of CPU horsepower and memory.
|>
|> I see a speech project on the OM projects page.  To me, it seems like
|> the project is attempting command recognition, e.g., voice dialing.
|> However, it would be great if the FR can function as a rudimentary
|> dictation machine, i.e., allow the user to speak and convert to text.
|>
|> Perhaps the following may work.
|>
|> 1. Ask the user to speak some standard words.  Record the speech and
|>     establish the mapping from the words to the corresponding speech.
|>     It may even be good to maintain separate databases for different
|>     purposes, e.g., one for UNIX command lines, one for emails, and a
|>     third for technical documents.
|>
|> 2. The speech recognizer then functions similar to a keyboard in that it
|>     converts speech to text which it then enters into the application
|>     that has focus.
|>
|> 3. The user must speak word by word.  The speech recognizer finds the
|>     closest match for the speech my checking against the recordings made
|>     in step 1 (and step 4).  The user may need to set the database from
|>     which the match must be made.
|>
|> 4. If there is no close match, or if the user is unhappy with the
|>     selection made in step 3, the user can type in the correct word.  A
|>     new record can be added to the appropriate database.
|>
|> The process may be frustrating for the user at first, but over time, the
|> speech recognition should become better and better.
|>
|> The separate databases may be needed, for example, because the word
|> period should usually translate to the symbol `.' except when writing
|> about time periods when it should translate to the word `period'.
|>
|> I do not know what the storage requirements would be to maintain this
|> database.  I do not know if the closest match algorithm in step 3 is
|> even possible.  But if we could get a good dictation engine, that would
|> be a killer app, in my opinion.  No more typing!  No more carpal tunnel
|> injuries.  No more having to worry about small on screen keyboards that
|> challenge finger typing.
|>
|> Thanks.
|>
|> Ajit
|>
|>
|
|
| _______________________________________________
| Openmoko community mailing list
| community at lists.openmoko.org
| http://lists.openmoko.org/mailman/listinfo/community

Along with other speex to text engines (as someone else already
mentioned), it works best when the engine knows that he could have said
something in this list of pre-defined commands, and not any word in general.

It is also very good for deciding between two words, eg "yes" or "no",
which is more useful than you would think, if you design your interface
to the user in the right way.

They also have a sphinx mobile-type of library, which seems to be very
lightweight, and might be worth looking into.

One thing I thought of is when someone tells you a number over the
phone, the phone could record and add it to the address book.

Lots of cool stuff you could do :)

- -brandon
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFIVeVaWSn2Kv7ZyAoRAoOKAJ9V2psUqf9TniZYMUbPp83hvm9lOgCfSaDI
qlZ6A+HqDGzZDKpUDaj+oDA=
=Wgj4
-----END PGP SIGNATURE-----




More information about the community mailing list