[SHR] illume predictive keyboard is too slow

Carsten Haitzler (The Rasterman) raster at rasterman.com
Fri Jan 30 20:12:55 CET 2009


On Fri, 30 Jan 2009 08:31:43 +0100 Olof Sjobergh <olofsj at gmail.com> said:

> On Fri, Jan 30, 2009 at 4:25 AM, The Rasterman Carsten Haitzler
> <raster at rasterman.com> wrote:
> > On Thu, 29 Jan 2009 08:30:44 +0100 Olof Sjobergh <olofsj at gmail.com> said:
> >
> >> On Wed, Jan 28, 2009 at 11:16 PM, The Rasterman Carsten Haitzler
> >> <raster at rasterman.com> wrote:
> >> > On Wed, 28 Jan 2009 18:59:32 +0100 "Marco Trevisan (Treviño)"
> >> > <mail at 3v1n0.net> said:
> >> >
> >> >> Olof Sjobergh wrote:
> >> >> > Unless I missed something big (which I hope I didn't, but I wouldn't
> >> >> > be surprised if I did), this is not fixable with the current
> >> >> > dictionary lookup design. Raster talked about redesigning the
> >> >> > dictionary format, so I guess we have to wait until he gets around to
> >> >> > it (or someone else does it).
> >> >>
> >> >> I think that too. Maybe using something like a "trie" [1] to archive the
> >> >> words could help (both for words matching and for compressing the
> >> >> dictionary).
> >> >> Too hard?
> >> >>
> >> >> [1] http://en.wikipedia.org/wiki/Trie
> >> >
> >> > the problem here comes with having multiple displays for a single match.
> >> > let me take japanese as an example (i hope you have the fonts to see this
> >> > at least - though there is no need to understand beyond knowing that
> >> > there are a lot of matches that are visibly different):
> >> >
> >> > sakana ->
> >> >  さかな 茶菓な 肴 魚 サカナ 坂な 差かな 左かな 査かな 鎖かな 鎖
> >> > かな
> >> >
> >> > unlike simple decimation of é -> e and ë -> e and è -> e etc. you need 1
> >> > ascii input string matching one of MANY very different matches. the
> >> > european case of
> >> >
> >> > vogel -> Vogel Vögel
> >> >
> >> > is a simplified version of the above. the reason i wanted "decimation to
> >> > match a simple roman text (ascii) string is - that this is a pretty
> >> > universal thing. thats how japanese, chinese and even some korean input
> >> > methods work. it also works for european languages too. europeans are NOT
> >> > used to the idea of a dictionary guessing/selecting system when they
> >> > type - but the asians are. they are always typing and selecting. the
> >> > smarts come with the dictionary system selecting the right one more
> >> > often than not by default or the right selection you want being only 1
> >> > or 2 keystrokes away.
> >> >
> >> > i was hoping to be able to keep a SIMPLE ascii qwerty keyboard for as
> >> > much as possible - so you can just type and it will work and offer the
> >> > selections as it's trying to guess anyway - it can present the multiple
> >> > accented versions too. this limits the need for special keyboards -
> >> > doesn't obviate it, but allows more functionality out of the box. in the
> >> > event users explicitly select an accented char - ie a non-ascii
> >> > character, it should not "decimate". it should try match exactly that
> >> > char.
> >> >
> >> > so if you add those keys and use them or flip to another key layout to
> >> > select them - you get what you expect. but if i am to redo the dict - the
> >> > api is very generic - just the internals and format need changing to be
> >> > able to do the above. the cool bit is.. if i manage the above... it has
> >> > almost solved asian languages too - and input methods... *IF* the vkbd is
> >> > also able to talk to a complex input method (XIM/SCIM/UIM etc.) as
> >> > keystroke faking wont let you type chinese characters... :) but in
> >> > principle the dictionary and lookup scheme will work - its then just
> >> > mechanics of sending the data to the app in a way it can use it.
> >> >
> >> > so back to the trie... the trie would only be useful for the ascii
> >> > matching
> >> > - i need something more complex. it just combines the data with the match
> >> > tree (letters are inline). i need a match tree + lookup table to other
> >> > matches to display - and possibly several match entries (all the matches
> >> > to display also need to be in the tree pointing to a smaller match list).
> >> >
> >> > --
> >> > ------------- Codito, ergo sum - "I code, therefore I am" --------------
> >> > The Rasterman (Carsten Haitzler)    raster at rasterman.com
> >>
> >> I think most problems could be solved by using a dictionary format
> >> similar to what you describe above, i.e. something like:
> >>
> >> match : candidate1 candidate2; frequency
> >> for example:
> >> vogel : Vogel Vögel; 123
> >>
> >> That would mean you can search on the normalised word where simple
> >> strcmp works fine and will be fast enough. To not make it too large
> >> for example the following syntax could also be accepted:
> >> eat; 512     // No candidates, just show the match as is
> >> har här hår; 1234    // Also show the match itself as a candidate
> >>
> >> If you think this would be good enough, I could try to implement it.
> >>
> >> Another problem with languages like Swedish, and also Japanese, is the
> >> heavy use of conjugation. For example, in Japanese the verbs 食べる and
> >> 考える can both be conjugated in the same way like this:
> >> 食べる 食べました 食べた 食べている 食べていた 食べています 食べてい
> >> ました考える 考えました 考えた 考えている 考えていた 考えています 考
> >> えていました
> >>
> >> Another example, the Swedish nouns:
> >> bil bilen bilar bilarna bilens bilarnas
> >>
> >> But including all these forms in a dictionary makes it very large,
> >> which is impractical. So some way to indicate possible conjugations
> >> would be good, but it would make the dictionary format a lot more
> >> complex.
> >
> > the real problem is... how on EARTH will such a dictionary get written? who
> > will write all of that? the advantage to the simple "just list lots of words
> > and ALL their forms" is easy - it can be generated by just scanning
> > documents - indexing words and counting frequency of each (throw enough
> > documents/books/newspaper articles etc. at it and you get a good
> > cross-section of the language). once you have a structured format that
> > requires you list a match string - then all the possible visible matches
> > (as per the japanese etc. examples above) and even then have to list
> > conjugations ... who will write that
> > - how do you get the info. this is all find and well designing in a vacuum
> > for something we can in theory discuss and say "well if the dict had this
> > word like this.. and this one like this .. etc. - it'd work" which it just
> > may - very well.. the problem is.. adding the 1000's and 1000's of those
> > words and all that data in. as i dont see a good solution to that - i'm not
> > redesigning a dict format until i know that dicts can be sanely created in
> > such a format... "build it and they will come" doesn't work... no point
> > building it... if you know they won't come as its pretty much never going
> > to happen that the dicts can be created (that are useful).
> >
> > so it's not really size - it's just all that data. (if u have the conjugated
> > versions you can start compressing via: bil ~en ~ar ~arna ~ens ~arnas)
> > though you will need frequency usage for each conjugated form for matching
> > etc.
> >
> > right now the only thing that makes sense is a trie flattened to a file u
> > can mmap with the letters in each trie node being already expanded to ucs2
> > (not in utf8) for simple matching. but this doesn't solve mapping accents
> > (decimating them) etc. etc.
> >
> > perhaps a dict needs to have a trie and then a bunch of rules on mappings...
> > but this precludes kanji "sanely" as at least in japanese every kanji has a
> > load of readings - and mapping back will be royally nasty and probably not
> > work as most will be invalid for the combination...
> >
> > --
> > ------------- Codito, ergo sum - "I code, therefore I am" --------------
> > The Rasterman (Carsten Haitzler)    raster at rasterman.com
> >
> 
> I agree that trying to handle conjugations sanely is too much trouble.
> And as you point out, the data is hard to find or generate (though for
> example anthy that does Japanese input uses it, don't know how
> though).
> 
> But I think a dictionary format in plain utf8 that includes the
> normalised words as well as any candidates to display would be the
> best way. Then the dictionary itself could choose which characters to
> normalise and which to leave as is. So for Swedish, you can leave å, ä
> and ö as they are but normalise é, à etc. Searching would be as simple
> as in your original implementation (no need to convert from multibyte
> format).

the problem is - the dict in utf8 means searching is slow as you do it in utf8
space. the dict is mmaped() to save ram - if it wasnt it'd need to be allocated
in non-swappable ram (its a phone - it has no swap) and thus a few mb of your
ram goes into the kbd dict at all times. by using mmap you leave it to the
kernels paging system to figure it out.

so as such a dict change will mean a non-ascii format in future for this
reason. but there will then need to be a tool to generate such a file.

> Example of how this could look:
> har 1234                    // Simple words with freq data
> här 1345
> hår 1223
> a : à 12                     // Normalised word, show only à as candidate
> 
> This way, you could also include abbreviations if you want, example:
> e  är 11111                                           // Show both e
> and är as candidates when writing "e".
> mvh : "med vänliga hälsningar" 1234       // Abbreviated sentence
> 
> And generation of new dictionaries would be simple. It's easy to write
> a small script that given a list of words (eg. the current dictionary)
> and a normalisation table gives a correctly formated and sorted
> dictionary. Even today you still have to figure out how to sort the
> dictionary, so a small script that does everything might even simplify
> it a bit.
> 
> That's just my few ideas. If someone likes it, I might even try to
> implement it. =)
> 
> Best regards,
> 
> Olof
> 


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com





More information about the community mailing list