[SHR] illume predictive keyboard is too slow

Carsten Haitzler (The Rasterman) raster at rasterman.com
Tue Feb 3 06:25:39 CET 2009


On Mon, 2 Feb 2009 19:39:52 +0200 Kostis Anagnostopoulos <ankostis at gmail.com>
said:

> On Sun 01 Feb 2009 00:31:09 Carsten Haitzler wrote:
> > On Fri, 30 Jan 2009 21:16:57 +0100 Olof Sjobergh <olofsj at gmail.com> said:
> > > On Fri, Jan 30, 2009 at 8:12 PM, The Rasterman Carsten Haitzler
> > >
> > > <raster at rasterman.com> wrote:
> > > > On Fri, 30 Jan 2009 08:31:43 +0100 Olof Sjobergh <olofsj at gmail.com> 
> said:
> > > >> But I think a dictionary format in plain utf8 that includes the
> > > >> normalised words as well as any candidates to display would be the
> > > >> best way. Then the dictionary itself could choose which characters to
> > > >> normalise and which to leave as is. So for Swedish, you can leave å, ä
> > > >> and ö as they are but normalise é, à etc. Searching would be as simple
> > > >> as in your original implementation (no need to convert from multibyte
> > > >> format).
> > > >
> > > > the problem is - the dict in utf8 means searching is slow as you do it
> > > > in utf8 space. the dict is mmaped() to save ram - if it wasnt it'd need
> > > > to be allocated in non-swappable ram (its a phone - it has no swap) and
> > > > thus a few mb of your ram goes into the kbd dict at all times. by using
> > > > mmap you leave it to the kernels paging system to figure it out.
> > > >
> > > > so as such a dict change will mean a non-ascii format in future for
> > > > this reason. but there will then need to be a tool to generate such a
> > > > file.
> > >
> > > Searching in utf8 doesn't mean it has to be slow. Simple strcmp works
> > > fine on multibyte utf8 strings as well, and should be as fast as the
> > > dictionary was before adding multibyte to widechars conversions. But
> > > if you have some other idea in mind, please don't let me disturb. =)
> >
> > the problem is - it INSt a simple key>value lookup. it's a possible-match
> > tree build on-the-fly. that means you jump about examining 1 character at a
> > time. the problem here is that 1 char may or may not be 1 byte or more and
> > that makes it really nasty. if it were a simple key lookup for a given
> > simple string - life would be easy. this is possible - but then u'd have to
> > generate ALL permutations first then look ALL of them up. if you weed out
> > permutations AS you look them up you can weed out something like 90% of the
> > permutations as you KNOw there are no words starting with qz... so as you
> > go through qa... qs.... qx... qz... you can easily stop all the
> > combinations with qs, qz ans qx as no words begin with that (if you have an
> > 8 letter word with 8 possible letters per character in the word thats 8^6
> > lookups you avoided (in the case above - ie all permutations of the other 6
> > letters). thats 262144 lookups avoided... just there. for... 1 of the above
> > impossible permutation trees. now add it up over all of them.
> 
> Do you consider this paper relevant?
> http://citeseer.ist.psu.edu/schulz02fast.html
> "Fast String Correction with Levenshtein-Automata", (2002),  Klaus Schulz, 
> Stoyan Mihov
> 
> It actually uses tries to avoid generating and comparing exhaustively all 
> permutations of the input word (typed keys),
> but instead traverses *only* know words and accumulates permutations unless 
> a max-errors limit gets exceeded, in which case this path dies.

not sure thats that good.. that will drop possible matches - the current scheme
walks the tree of known words using the permutation list to pick paths - it
wotn follow paths that dont exist, so thats already done. i was just saying
that you need the permutation list per letter + walking of the data to be
inherently combined. as without that you need to generate every permutation and
throw it at a 1 key -> value lookup hash. it still uses a trie ( which is a
binary tree with the letters inlined as part of the tree struct). :)

just reading the abstract tho.. document is 67 pages i have to dig through...

> It describes a mathematical model for correcting typos,
> but since i have already implemented it (in java) 
> i know think it can be retrofitted to perform what you describe in:
> http://wiki.openmoko.org/wiki/Illume_keyboard

sure - can it be implemented so all data is mmaped from files? thats the
biggest problem. the first dict for illume (before the current) used a 27-way
per node tree - lookups were hyper-fast. but it ate ram. i went to the opposite
end where i just mmaped the test file and built a very small 2-level char
offset lookup table to avoid ram usage. this isnt that fast - but was ok. i
know i could improve the parsing with having it all ucs2 to avoid slower utf8
decomposing and with line jump-tables built into the file it'd avoid scanning a
whole line to jump to the next entry when a match fails. as such it's more a
matter of just having a fast dict format that can be mmaped and walked easily
while spooling off the permutations of chars per letter (and thus being able to
spot a match and calculate its relative distance).

> Keep up the good work.
>   Kostis
> 
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community


-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com





More information about the community mailing list