[SHR] illume predictive keyboard is too slow

Carsten Haitzler (The Rasterman) raster at rasterman.com
Fri Jan 30 04:25:08 CET 2009


On Thu, 29 Jan 2009 08:30:44 +0100 Olof Sjobergh <olofsj at gmail.com> said:

> On Wed, Jan 28, 2009 at 11:16 PM, The Rasterman Carsten Haitzler
> <raster at rasterman.com> wrote:
> > On Wed, 28 Jan 2009 18:59:32 +0100 "Marco Trevisan (Treviño)"
> > <mail at 3v1n0.net> said:
> >
> >> Olof Sjobergh wrote:
> >> > Unless I missed something big (which I hope I didn't, but I wouldn't
> >> > be surprised if I did), this is not fixable with the current
> >> > dictionary lookup design. Raster talked about redesigning the
> >> > dictionary format, so I guess we have to wait until he gets around to
> >> > it (or someone else does it).
> >>
> >> I think that too. Maybe using something like a "trie" [1] to archive the
> >> words could help (both for words matching and for compressing the
> >> dictionary).
> >> Too hard?
> >>
> >> [1] http://en.wikipedia.org/wiki/Trie
> >
> > the problem here comes with having multiple displays for a single match.
> > let me take japanese as an example (i hope you have the fonts to see this
> > at least - though there is no need to understand beyond knowing that there
> > are a lot of matches that are visibly different):
> >
> > sakana ->
> >  さかな 茶菓な 肴 魚 サカナ 坂な 差かな 左かな 査かな 鎖かな 鎖かな
> >
> > unlike simple decimation of é -> e and ë -> e and è -> e etc. you need 1
> > ascii input string matching one of MANY very different matches. the
> > european case of
> >
> > vogel -> Vogel Vögel
> >
> > is a simplified version of the above. the reason i wanted "decimation to
> > match a simple roman text (ascii) string is - that this is a pretty
> > universal thing. thats how japanese, chinese and even some korean input
> > methods work. it also works for european languages too. europeans are NOT
> > used to the idea of a dictionary guessing/selecting system when they type -
> > but the asians are. they are always typing and selecting. the smarts come
> > with the dictionary system selecting the right one more often than not by
> > default or the right selection you want being only 1 or 2 keystrokes away.
> >
> > i was hoping to be able to keep a SIMPLE ascii qwerty keyboard for as much
> > as possible - so you can just type and it will work and offer the
> > selections as it's trying to guess anyway - it can present the multiple
> > accented versions too. this limits the need for special keyboards - doesn't
> > obviate it, but allows more functionality out of the box. in the event
> > users explicitly select an accented char - ie a non-ascii character, it
> > should not "decimate". it should try match exactly that char.
> >
> > so if you add those keys and use them or flip to another key layout to
> > select them - you get what you expect. but if i am to redo the dict - the
> > api is very generic - just the internals and format need changing to be
> > able to do the above. the cool bit is.. if i manage the above... it has
> > almost solved asian languages too - and input methods... *IF* the vkbd is
> > also able to talk to a complex input method (XIM/SCIM/UIM etc.) as
> > keystroke faking wont let you type chinese characters... :) but in
> > principle the dictionary and lookup scheme will work - its then just
> > mechanics of sending the data to the app in a way it can use it.
> >
> > so back to the trie... the trie would only be useful for the ascii matching
> > - i need something more complex. it just combines the data with the match
> > tree (letters are inline). i need a match tree + lookup table to other
> > matches to display - and possibly several match entries (all the matches to
> > display also need to be in the tree pointing to a smaller match list).
> >
> > --
> > ------------- Codito, ergo sum - "I code, therefore I am" --------------
> > The Rasterman (Carsten Haitzler)    raster at rasterman.com
> 
> I think most problems could be solved by using a dictionary format
> similar to what you describe above, i.e. something like:
> 
> match : candidate1 candidate2; frequency
> for example:
> vogel : Vogel Vögel; 123
> 
> That would mean you can search on the normalised word where simple
> strcmp works fine and will be fast enough. To not make it too large
> for example the following syntax could also be accepted:
> eat; 512     // No candidates, just show the match as is
> har här hår; 1234    // Also show the match itself as a candidate
> 
> If you think this would be good enough, I could try to implement it.
> 
> Another problem with languages like Swedish, and also Japanese, is the
> heavy use of conjugation. For example, in Japanese the verbs 食べる and
> 考える can both be conjugated in the same way like this:
> 食べる 食べました 食べた 食べている 食べていた 食べています 食べていまし
> た考える 考えました 考えた 考えている 考えていた 考えています 考えていま
> した
> 
> Another example, the Swedish nouns:
> bil bilen bilar bilarna bilens bilarnas
> 
> But including all these forms in a dictionary makes it very large,
> which is impractical. So some way to indicate possible conjugations
> would be good, but it would make the dictionary format a lot more
> complex.

the real problem is... how on EARTH will such a dictionary get written? who
will write all of that? the advantage to the simple "just list lots of words
and ALL their forms" is easy - it can be generated by just scanning documents -
indexing words and counting frequency of each (throw enough
documents/books/newspaper articles etc. at it and you get a good cross-section
of the language). once you have a structured format that requires you list a
match string - then all the possible visible matches (as per the japanese etc.
examples above) and even then have to list conjugations ... who will write that
- how do you get the info. this is all find and well designing in a vacuum for
something we can in theory discuss and say "well if the dict had this word like
this.. and this one like this .. etc. - it'd work" which it just may - very
well.. the problem is.. adding the 1000's and 1000's of those words and all
that data in. as i dont see a good solution to that - i'm not redesigning a
dict format until i know that dicts can be sanely created in such a format...
"build it and they will come" doesn't work... no point building it... if you
know they won't come as its pretty much never going to happen that the dicts
can be created (that are useful).

so it's not really size - it's just all that data. (if u have the conjugated
versions you can start compressing via: bil ~en ~ar ~arna ~ens ~arnas) though
you will need frequency usage for each conjugated form for matching etc.

right now the only thing that makes sense is a trie flattened to a file u can
mmap with the letters in each trie node being already expanded to ucs2 (not in
utf8) for simple matching. but this doesn't solve mapping accents (decimating
them) etc. etc.

perhaps a dict needs to have a trie and then a bunch of rules on mappings...
but this precludes kanji "sanely" as at least in japanese every kanji has a
load of readings - and mapping back will be royally nasty and probably not work
as most will be invalid for the combination...

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com





More information about the community mailing list