New Rasterman Image...

Thu Oct 2 18:37:58 CEST 2008

I used to have a bunch of them when I was doing a NLG ( natural Language 
generation) pet project. I sent you a link to US names as well. from the
US census.
For personal dictionaries, people could just run a simple word frequency 
analysis   on their archived email,( there are GPL programs that do this 
I believe, but its dead easy to write yourself) and import their email 
contacts into the database.
( speeling mistakes might require some work, like the one I just did)
If you had access to archived chats or chat logs you could pick up
things like LOL, PITA, etc, or logs of SMS. There are some studies on 
word frequency in SMS but I havent found a online resource.

Carsten Haitzler (The Rasterman) wrote:
> On Thu, 02 Oct 2008 00:59:20 -0700 Steve Mosher <steve at openmoko.com> babbled:
> 
> aha! a "decent" frequency corpus (a few thousand words). i'll merge this with
> the default us dict (and remove the small one as now that's useless).
> 
>> Hey raster, How's it going.
>>
>> I promised you some frequency data a while back.
>> http://ucrel.lancs.ac.uk/bncfreq/flists.html
>> http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt
>>
>> there are others as well
>>
>> Carsten Haitzler (The Rasterman) wrote:
>>> On Wed, 1 Oct 2008 21:05:53 -0600 "Ori Pessach" <opessach at gmail.com>
>>> babbled:
>>>
>>>> I understand what it's doing. It's not doing it well. I tried it for shell
>>> i disagree. it works like a charm for me - as per my previous mail - i can
>>> use it while walking down the street. more than i can say for pretty much
>>> any other virtual keyboard i have available to me.
>>>
>>>> input, and it was an unusable mess. I tried it for text messaging, and it
>>> why someone would use a language dictionary-based corrective keyboard for
>>> shell input beats me! in this case i call "silly user - using a motorcycle
>>> to deliver elephants" line :) use the terminal keyboard. use a stylus.
>>> thats what it was meant for. :)
>>>
>>>> was an unusable mess. It has no model of the likelihood of erroneous input
>>> it does. it absolutely does. maybe your fingers are incredibly off-center?
>>> here is the algorithm (and if u don't believe me - code is there to be
>>> read):
>>>
>>> it stores a press POINT (x,y). it looks for all keys whose center point is
>>> WITHIN f distance of x,y (f being the fuzz value - the .kbd file for the
>>> qwerty Default keyboard is 135 units wide, with fuzz radius of 20, so that's
>>> about 1/3rd of the keyboard that it searches through for a likely match).
>>> likelihood factors (distance) per key found is allocated based on distance
>>> (0 == most likely, > 0 less likely the greater the value). each press is
>>> done this way EXCEPT if u hold for 0.25 sec then drag to select a key
>>> explicitly in zoom mode - then the ONLY key available for that word slot is
>>> that letter selected given a distance of 0. as you type all permutations of
>>> letters are searched and put into a list - with each permutation given a
>>> distance metric based on the letters used (simply addition of the
>>> distances). now this is combined with the dictionary's frequency metric
>>> (multiplied by an inverse) so the more likely the word is to be used the
>>> lower its distance becomes. words are sorted from most to least likely
>>> based on this metric then listed with most likely in the middle of the
>>> list, leas likely to the left/right ends - which you may not see. the
>>> vertical list lists all matches from most to least likely (top to bottom)
>>> with 1 exception - EXACTLY what u typed it as the top. it absolutely has a
>>> fairly good idea of likelihood of error and likelihood of usage of a word
>>> etc. etc.
>>>
>>> eg:
>>>
>>> Press | Guess+dist
>>> e       e+0 w+1 r+2 d+2 s+1
>>> r       r+0 t+1 e+2 f+1 g+2 d+3
>>> k       k+0 l+1 o+2 i+3 j+3
>>> d       d+0 f+1 s+1 e+1 c+1 r+2 w+2
>>>
>>> so "erkd" has distance 0 = but its not a word in the dictionary at all, so
>>> thrown out. "rwkd" has distance 1, but not a word, "srkd" same, "etkd",
>>> "efkd", "erld", etc. etc.
>>>
>>> in the end it produces a list where most likely "world" ends up the word
>>> with other options too - and this is a much simplified list. mostly the list
>>> for candidate letters per input letter is about 10-12 letters. so u have
>>> 12*12*12*12 permutations for a 4 letter word - of which a fraction of that
>>> space is legitimate words. each permutation has a likelihood value based on
>>> press distance and on frequency of usage of that word in language in
>>> general in the dictionary.
>>>
>>> mind you - i AM talking about illume's keyboard, its algorithms as is in the
>>> image i built. if you use something else i cannot comment as it's something
>>> else.
>>>
>>>> (relatively low) and instead appears to look for the word with the closest
>>>> minimum edit distance to the user's input. This is nuts. I have never -
>>> it's not - as the edit distance is the likelihood of error. you likely press
>>> the key you want - or near it. thus keys near where you pressed are more
>>> likely than those further away. to limit search distance only up to a
>>> certain distance is searched. chances are that you do this:
>>>
>>> fingerprint:
>>>   ___
>>>  /~~~\
>>>  |~~~|
>>>  |~~~|
>>>   \x/
>>>    "
>>>
>>> where "x" is the pressure point reported on the touchscreen. the only info
>>> the touchscreen reports is the pressure point - nothing else. you think u
>>> press somewhere else, but don't. you know what u pressed bu what key "pops
>>> up" that lets u know pretty well how good your pressing of the screen is.
>>> this is just a hardware limit of a resistive touchscreen. the point of
>>> greatest pressure is used - not the middle point of the area in which skin
>>> contacts the screen. get the gpe-sketchbook and try press with the flat of
>>> your finger and see just of far off your press point is. it may surprise
>>> you.
>>>
>>> as i said - it does have all the model and code and even data to do proper
>>> correction based on many factors. i do NOT have a dictionary with frequency
>>> info for all of english - there is a "small" english dict (5000 words) with
>>> some frequency info in it i managed to gather, but its very small.
>>>
>>> if you don't believe me - read the code, or do better. patches accepted,
>>> but i think the problem is just that the dictionary has no frequency info
>>> by default (a matter of simple lack of data) or how you press the screen. i
>>> suggest you pay close attention to how you type and see. yes the "black
>>> word" (in the black box) may not be always the word u want - but its most
>>> often that word or a word right next to it - as you use it it will learn.
>>> if you are using it for non-english stuff then you need a different
>>> dictionary.
>>>
>>>> literally - gotten the word I typed in. In the common use case, of a user
>>>> who enters a correct word, it invariably get it wrong.
>>>>
>>>> Understanding what it's doing doesn't make it less of a nuisance.
>>> it does have concept of frequency orf words. i just dont have any DATA for
>>> that. the dict format handles is:
>>> word1
>>> word2
>>> word3
>>>
>>> OR
>>> word1 20
>>> word2 434
>>> word3 1
>>>
>>> etc.
>>> look at the personal dict file. ~/.e/e/dicts-dynamic/personal.dic
>>>
>>> it saves usage frequency. this affects lookup likelihood. btw - for me it
>>> gets the word most of the time or the word is not the most likely but at
>>> least listed as one of the most likely. use it for a bit and it learns and
>>> gets better. if you wish to generate a dictionary with frequency info -
>>> please do so! i made it really easy.
>>>
>> _______________________________________________
>> Openmoko community mailing list
>> community at lists.openmoko.org
>> http://lists.openmoko.org/mailman/listinfo/community
>>
> 
>