New Rasterman Image...

Thu Oct 2 11:27:20 CEST 2008

On Thu, 02 Oct 2008 00:59:20 -0700 Steve Mosher <steve at openmoko.com> babbled:

aha! a "decent" frequency corpus (a few thousand words). i'll merge this with
the default us dict (and remove the small one as now that's useless).

> Hey raster, How's it going.
> 
> I promised you some frequency data a while back.
> http://ucrel.lancs.ac.uk/bncfreq/flists.html
> http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt
> 
> there are others as well
> 
> Carsten Haitzler (The Rasterman) wrote:
> > On Wed, 1 Oct 2008 21:05:53 -0600 "Ori Pessach" <opessach at gmail.com>
> > babbled:
> > 
> >> I understand what it's doing. It's not doing it well. I tried it for shell
> > 
> > i disagree. it works like a charm for me - as per my previous mail - i can
> > use it while walking down the street. more than i can say for pretty much
> > any other virtual keyboard i have available to me.
> > 
> >> input, and it was an unusable mess. I tried it for text messaging, and it
> > 
> > why someone would use a language dictionary-based corrective keyboard for
> > shell input beats me! in this case i call "silly user - using a motorcycle
> > to deliver elephants" line :) use the terminal keyboard. use a stylus.
> > thats what it was meant for. :)
> > 
> >> was an unusable mess. It has no model of the likelihood of erroneous input
> > 
> > it does. it absolutely does. maybe your fingers are incredibly off-center?
> > here is the algorithm (and if u don't believe me - code is there to be
> > read):
> > 
> > it stores a press POINT (x,y). it looks for all keys whose center point is
> > WITHIN f distance of x,y (f being the fuzz value - the .kbd file for the
> > qwerty Default keyboard is 135 units wide, with fuzz radius of 20, so that's
> > about 1/3rd of the keyboard that it searches through for a likely match).
> > likelihood factors (distance) per key found is allocated based on distance
> > (0 == most likely, > 0 less likely the greater the value). each press is
> > done this way EXCEPT if u hold for 0.25 sec then drag to select a key
> > explicitly in zoom mode - then the ONLY key available for that word slot is
> > that letter selected given a distance of 0. as you type all permutations of
> > letters are searched and put into a list - with each permutation given a
> > distance metric based on the letters used (simply addition of the
> > distances). now this is combined with the dictionary's frequency metric
> > (multiplied by an inverse) so the more likely the word is to be used the
> > lower its distance becomes. words are sorted from most to least likely
> > based on this metric then listed with most likely in the middle of the
> > list, leas likely to the left/right ends - which you may not see. the
> > vertical list lists all matches from most to least likely (top to bottom)
> > with 1 exception - EXACTLY what u typed it as the top. it absolutely has a
> > fairly good idea of likelihood of error and likelihood of usage of a word
> > etc. etc.
> > 
> > eg:
> > 
> > Press | Guess+dist
> > e       e+0 w+1 r+2 d+2 s+1
> > r       r+0 t+1 e+2 f+1 g+2 d+3
> > k       k+0 l+1 o+2 i+3 j+3
> > d       d+0 f+1 s+1 e+1 c+1 r+2 w+2
> > 
> > so "erkd" has distance 0 = but its not a word in the dictionary at all, so
> > thrown out. "rwkd" has distance 1, but not a word, "srkd" same, "etkd",
> > "efkd", "erld", etc. etc.
> > 
> > in the end it produces a list where most likely "world" ends up the word
> > with other options too - and this is a much simplified list. mostly the list
> > for candidate letters per input letter is about 10-12 letters. so u have
> > 12*12*12*12 permutations for a 4 letter word - of which a fraction of that
> > space is legitimate words. each permutation has a likelihood value based on
> > press distance and on frequency of usage of that word in language in
> > general in the dictionary.
> > 
> > mind you - i AM talking about illume's keyboard, its algorithms as is in the
> > image i built. if you use something else i cannot comment as it's something
> > else.
> > 
> >> (relatively low) and instead appears to look for the word with the closest
> >> minimum edit distance to the user's input. This is nuts. I have never -
> > 
> > it's not - as the edit distance is the likelihood of error. you likely press
> > the key you want - or near it. thus keys near where you pressed are more
> > likely than those further away. to limit search distance only up to a
> > certain distance is searched. chances are that you do this:
> > 
> > fingerprint:
> >   ___
> >  /~~~\
> >  |~~~|
> >  |~~~|
> >   \x/
> >    "
> > 
> > where "x" is the pressure point reported on the touchscreen. the only info
> > the touchscreen reports is the pressure point - nothing else. you think u
> > press somewhere else, but don't. you know what u pressed bu what key "pops
> > up" that lets u know pretty well how good your pressing of the screen is.
> > this is just a hardware limit of a resistive touchscreen. the point of
> > greatest pressure is used - not the middle point of the area in which skin
> > contacts the screen. get the gpe-sketchbook and try press with the flat of
> > your finger and see just of far off your press point is. it may surprise
> > you.
> > 
> > as i said - it does have all the model and code and even data to do proper
> > correction based on many factors. i do NOT have a dictionary with frequency
> > info for all of english - there is a "small" english dict (5000 words) with
> > some frequency info in it i managed to gather, but its very small.
> > 
> > if you don't believe me - read the code, or do better. patches accepted,
> > but i think the problem is just that the dictionary has no frequency info
> > by default (a matter of simple lack of data) or how you press the screen. i
> > suggest you pay close attention to how you type and see. yes the "black
> > word" (in the black box) may not be always the word u want - but its most
> > often that word or a word right next to it - as you use it it will learn.
> > if you are using it for non-english stuff then you need a different
> > dictionary.
> > 
> >> literally - gotten the word I typed in. In the common use case, of a user
> >> who enters a correct word, it invariably get it wrong.
> >>
> >> Understanding what it's doing doesn't make it less of a nuisance.
> > 
> > it does have concept of frequency orf words. i just dont have any DATA for
> > that. the dict format handles is:
> > word1
> > word2
> > word3
> > 
> > OR
> > word1 20
> > word2 434
> > word3 1
> > 
> > etc.
> > look at the personal dict file. ~/.e/e/dicts-dynamic/personal.dic
> > 
> > it saves usage frequency. this affects lookup likelihood. btw - for me it
> > gets the word most of the time or the word is not the most likely but at
> > least listed as one of the most likely. use it for a bit and it learns and
> > gets better. if you wish to generate a dictionary with frequency info -
> > please do so! i made it really easy.
> > 
> 
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community
> 

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com