IBM released a modified multimodal Opera web browser for the older-style
Zaurus (Embedix linux) that supports voice interaction tags - "Websphere
Everyplace Multimodal Environment".  I've played around with it, and it
works pretty well.

By using XML (XHTML plus VoiceXML, actually) and defining limited-domain
voice tags within a document it can distinguish spoken numbers, names,
pizza toppings, etc without training.  The engine should be able to
handle a screenful of 9-16 icons by name plus basic menus, for example.
 As long as each item consists of a distinct series of phonemes it's
smooth.  (it doesn't need to hear the difference between 'whiter' and
'writer' - it's not speech-to-text)

I for one find 'voice tags' on my cells to have been irritating, but
have always wanted to be able to just recite a number and store or dial,
or fire up the calculator and run some calculations, without pressing
buttons or navigating menus.  Between that and FLite (Festival Lite
speech synthesis engine, available for the Zaurus and various ARM-Linux
distros) you have the underpinnings of some very interesting possibilities.


