[SHR] illume predictive keyboard is too slow

Olof Sjobergh olofsj at gmail.com
Fri Jan 30 08:31:43 CET 2009


On Fri, Jan 30, 2009 at 4:25 AM, The Rasterman Carsten Haitzler
<raster at rasterman.com> wrote:
> On Thu, 29 Jan 2009 08:30:44 +0100 Olof Sjobergh <olofsj at gmail.com> said:
>
>> On Wed, Jan 28, 2009 at 11:16 PM, The Rasterman Carsten Haitzler
>> <raster at rasterman.com> wrote:
>> > On Wed, 28 Jan 2009 18:59:32 +0100 "Marco Trevisan (Treviño)"
>> > <mail at 3v1n0.net> said:
>> >
>> >> Olof Sjobergh wrote:
>> >> > Unless I missed something big (which I hope I didn't, but I wouldn't
>> >> > be surprised if I did), this is not fixable with the current
>> >> > dictionary lookup design. Raster talked about redesigning the
>> >> > dictionary format, so I guess we have to wait until he gets around to
>> >> > it (or someone else does it).
>> >>
>> >> I think that too. Maybe using something like a "trie" [1] to archive the
>> >> words could help (both for words matching and for compressing the
>> >> dictionary).
>> >> Too hard?
>> >>
>> >> [1] http://en.wikipedia.org/wiki/Trie
>> >
>> > the problem here comes with having multiple displays for a single match.
>> > let me take japanese as an example (i hope you have the fonts to see this
>> > at least - though there is no need to understand beyond knowing that there
>> > are a lot of matches that are visibly different):
>> >
>> > sakana ->
>> >  さかな 茶菓な 肴 魚 サカナ 坂な 差かな 左かな 査かな 鎖かな 鎖かな
>> >
>> > unlike simple decimation of é -> e and ë -> e and è -> e etc. you need 1
>> > ascii input string matching one of MANY very different matches. the
>> > european case of
>> >
>> > vogel -> Vogel Vögel
>> >
>> > is a simplified version of the above. the reason i wanted "decimation to
>> > match a simple roman text (ascii) string is - that this is a pretty
>> > universal thing. thats how japanese, chinese and even some korean input
>> > methods work. it also works for european languages too. europeans are NOT
>> > used to the idea of a dictionary guessing/selecting system when they type -
>> > but the asians are. they are always typing and selecting. the smarts come
>> > with the dictionary system selecting the right one more often than not by
>> > default or the right selection you want being only 1 or 2 keystrokes away.
>> >
>> > i was hoping to be able to keep a SIMPLE ascii qwerty keyboard for as much
>> > as possible - so you can just type and it will work and offer the
>> > selections as it's trying to guess anyway - it can present the multiple
>> > accented versions too. this limits the need for special keyboards - doesn't
>> > obviate it, but allows more functionality out of the box. in the event
>> > users explicitly select an accented char - ie a non-ascii character, it
>> > should not "decimate". it should try match exactly that char.
>> >
>> > so if you add those keys and use them or flip to another key layout to
>> > select them - you get what you expect. but if i am to redo the dict - the
>> > api is very generic - just the internals and format need changing to be
>> > able to do the above. the cool bit is.. if i manage the above... it has
>> > almost solved asian languages too - and input methods... *IF* the vkbd is
>> > also able to talk to a complex input method (XIM/SCIM/UIM etc.) as
>> > keystroke faking wont let you type chinese characters... :) but in
>> > principle the dictionary and lookup scheme will work - its then just
>> > mechanics of sending the data to the app in a way it can use it.
>> >
>> > so back to the trie... the trie would only be useful for the ascii matching
>> > - i need something more complex. it just combines the data with the match
>> > tree (letters are inline). i need a match tree + lookup table to other
>> > matches to display - and possibly several match entries (all the matches to
>> > display also need to be in the tree pointing to a smaller match list).
>> >
>> > --
>> > ------------- Codito, ergo sum - "I code, therefore I am" --------------
>> > The Rasterman (Carsten Haitzler)    raster at rasterman.com
>>
>> I think most problems could be solved by using a dictionary format
>> similar to what you describe above, i.e. something like:
>>
>> match : candidate1 candidate2; frequency
>> for example:
>> vogel : Vogel Vögel; 123
>>
>> That would mean you can search on the normalised word where simple
>> strcmp works fine and will be fast enough. To not make it too large
>> for example the following syntax could also be accepted:
>> eat; 512     // No candidates, just show the match as is
>> har här hår; 1234    // Also show the match itself as a candidate
>>
>> If you think this would be good enough, I could try to implement it.
>>
>> Another problem with languages like Swedish, and also Japanese, is the
>> heavy use of conjugation. For example, in Japanese the verbs 食べる and
>> 考える can both be conjugated in the same way like this:
>> 食べる 食べました 食べた 食べている 食べていた 食べています 食べていまし
>> た考える 考えました 考えた 考えている 考えていた 考えています 考えていま
>> した
>>
>> Another example, the Swedish nouns:
>> bil bilen bilar bilarna bilens bilarnas
>>
>> But including all these forms in a dictionary makes it very large,
>> which is impractical. So some way to indicate possible conjugations
>> would be good, but it would make the dictionary format a lot more
>> complex.
>
> the real problem is... how on EARTH will such a dictionary get written? who
> will write all of that? the advantage to the simple "just list lots of words
> and ALL their forms" is easy - it can be generated by just scanning documents -
> indexing words and counting frequency of each (throw enough
> documents/books/newspaper articles etc. at it and you get a good cross-section
> of the language). once you have a structured format that requires you list a
> match string - then all the possible visible matches (as per the japanese etc.
> examples above) and even then have to list conjugations ... who will write that
> - how do you get the info. this is all find and well designing in a vacuum for
> something we can in theory discuss and say "well if the dict had this word like
> this.. and this one like this .. etc. - it'd work" which it just may - very
> well.. the problem is.. adding the 1000's and 1000's of those words and all
> that data in. as i dont see a good solution to that - i'm not redesigning a
> dict format until i know that dicts can be sanely created in such a format...
> "build it and they will come" doesn't work... no point building it... if you
> know they won't come as its pretty much never going to happen that the dicts
> can be created (that are useful).
>
> so it's not really size - it's just all that data. (if u have the conjugated
> versions you can start compressing via: bil ~en ~ar ~arna ~ens ~arnas) though
> you will need frequency usage for each conjugated form for matching etc.
>
> right now the only thing that makes sense is a trie flattened to a file u can
> mmap with the letters in each trie node being already expanded to ucs2 (not in
> utf8) for simple matching. but this doesn't solve mapping accents (decimating
> them) etc. etc.
>
> perhaps a dict needs to have a trie and then a bunch of rules on mappings...
> but this precludes kanji "sanely" as at least in japanese every kanji has a
> load of readings - and mapping back will be royally nasty and probably not work
> as most will be invalid for the combination...
>
> --
> ------------- Codito, ergo sum - "I code, therefore I am" --------------
> The Rasterman (Carsten Haitzler)    raster at rasterman.com
>

I agree that trying to handle conjugations sanely is too much trouble.
And as you point out, the data is hard to find or generate (though for
example anthy that does Japanese input uses it, don't know how
though).

But I think a dictionary format in plain utf8 that includes the
normalised words as well as any candidates to display would be the
best way. Then the dictionary itself could choose which characters to
normalise and which to leave as is. So for Swedish, you can leave å, ä
and ö as they are but normalise é, à etc. Searching would be as simple
as in your original implementation (no need to convert from multibyte
format).

Example of how this could look:
har 1234                    // Simple words with freq data
här 1345
hår 1223
a : à 12                     // Normalised word, show only à as candidate

This way, you could also include abbreviations if you want, example:
e  är 11111                                           // Show both e
and är as candidates when writing "e".
mvh : "med vänliga hälsningar" 1234       // Abbreviated sentence

And generation of new dictionaries would be simple. It's easy to write
a small script that given a list of words (eg. the current dictionary)
and a normalisation table gives a correctly formated and sorted
dictionary. Even today you still have to figure out how to sort the
dictionary, so a small script that does everything might even simplify
it a bit.

That's just my few ideas. If someone likes it, I might even try to
implement it. =)

Best regards,

Olof




More information about the community mailing list