New Rasterman Image...

Fri Oct 3 01:34:28 CEST 2008

On Thu, 02 Oct 2008 09:37:58 -0700 Steve Mosher <steve at openmoko.com> babbled:

> I used to have a bunch of them when I was doing a NLG ( natural Language 
> generation) pet project. I sent you a link to US names as well. from the
> US census.
> For personal dictionaries, people could just run a simple word frequency 
> analysis   on their archived email,( there are GPL programs that do this 
> I believe, but its dead easy to write yourself) and import their email 
> contacts into the database.

in fact that is the idea of the "3" dictionaries the keyboard has. it has
"system" (which is base language - eg english), personal (any words they type
in at all go in here - they inherit frequency they had before but now gain in
count as they get used more), and.. "generated" dictionary
- ~/.e/e/dicts-dynamic/data.dic - this file is expected to be regularly
generated from the users sms's, emails, contact list etc. containing words from
their every-day activity - so that friend with a strange name... gets their
name into the dictionary pool this way. :)

> ( speeling mistakes might require some work, like the one I just did)
> If you had access to archived chats or chat logs you could pick up
> things like LOL, PITA, etc, or logs of SMS. There are some studies on 
> word frequency in SMS but I havent found a online resource.

yup. and as above. the idea of the generated dictionary would be to capture
these live as usage happens :) there's only so much hunting of data i can (and
will) do as a developer. it's a never-ending game of "gather more data into
dictionaries". i need to put in the mechanisms by which this is possible
(already done) and leave it up to the vast pool of users to fill that in for
each language, country, region etc. :)

> Carsten Haitzler (The Rasterman) wrote:
> > On Thu, 02 Oct 2008 00:59:20 -0700 Steve Mosher <steve at openmoko.com>
> > babbled:
> > 
> > aha! a "decent" frequency corpus (a few thousand words). i'll merge this
> > with the default us dict (and remove the small one as now that's useless).
> > 
> >> Hey raster, How's it going.
> >>
> >> I promised you some frequency data a while back.
> >> http://ucrel.lancs.ac.uk/bncfreq/flists.html
> >> http://ucrel.lancs.ac.uk/bncfreq/lists/1_2_all_freq.txt
> >>
> >> there are others as well
> >>
> >> Carsten Haitzler (The Rasterman) wrote:
> >>> On Wed, 1 Oct 2008 21:05:53 -0600 "Ori Pessach" <opessach at gmail.com>
> >>> babbled:
> >>>
> >>>> I understand what it's doing. It's not doing it well. I tried it for
> >>>> shell
> >>> i disagree. it works like a charm for me - as per my previous mail - i can
> >>> use it while walking down the street. more than i can say for pretty much
> >>> any other virtual keyboard i have available to me.
> >>>
> >>>> input, and it was an unusable mess. I tried it for text messaging, and it
> >>> why someone would use a language dictionary-based corrective keyboard for
> >>> shell input beats me! in this case i call "silly user - using a motorcycle
> >>> to deliver elephants" line :) use the terminal keyboard. use a stylus.
> >>> thats what it was meant for. :)
> >>>
> >>>> was an unusable mess. It has no model of the likelihood of erroneous
> >>>> input
> >>> it does. it absolutely does. maybe your fingers are incredibly off-center?
> >>> here is the algorithm (and if u don't believe me - code is there to be
> >>> read):
> >>>
> >>> it stores a press POINT (x,y). it looks for all keys whose center point is
> >>> WITHIN f distance of x,y (f being the fuzz value - the .kbd file for the
> >>> qwerty Default keyboard is 135 units wide, with fuzz radius of 20, so
> >>> that's about 1/3rd of the keyboard that it searches through for a likely
> >>> match). likelihood factors (distance) per key found is allocated based on
> >>> distance (0 == most likely, > 0 less likely the greater the value). each
> >>> press is done this way EXCEPT if u hold for 0.25 sec then drag to select
> >>> a key explicitly in zoom mode - then the ONLY key available for that word
> >>> slot is that letter selected given a distance of 0. as you type all
> >>> permutations of letters are searched and put into a list - with each
> >>> permutation given a distance metric based on the letters used (simply
> >>> addition of the distances). now this is combined with the dictionary's
> >>> frequency metric (multiplied by an inverse) so the more likely the word
> >>> is to be used the lower its distance becomes. words are sorted from most
> >>> to least likely based on this metric then listed with most likely in the
> >>> middle of the list, leas likely to the left/right ends - which you may
> >>> not see. the vertical list lists all matches from most to least likely
> >>> (top to bottom) with 1 exception - EXACTLY what u typed it as the top. it
> >>> absolutely has a fairly good idea of likelihood of error and likelihood
> >>> of usage of a word etc. etc.
> >>>
> >>> eg:
> >>>
> >>> Press | Guess+dist
> >>> e       e+0 w+1 r+2 d+2 s+1
> >>> r       r+0 t+1 e+2 f+1 g+2 d+3
> >>> k       k+0 l+1 o+2 i+3 j+3
> >>> d       d+0 f+1 s+1 e+1 c+1 r+2 w+2
> >>>
> >>> so "erkd" has distance 0 = but its not a word in the dictionary at all, so
> >>> thrown out. "rwkd" has distance 1, but not a word, "srkd" same, "etkd",
> >>> "efkd", "erld", etc. etc.
> >>>
> >>> in the end it produces a list where most likely "world" ends up the word
> >>> with other options too - and this is a much simplified list. mostly the
> >>> list for candidate letters per input letter is about 10-12 letters. so u
> >>> have 12*12*12*12 permutations for a 4 letter word - of which a fraction
> >>> of that space is legitimate words. each permutation has a likelihood
> >>> value based on press distance and on frequency of usage of that word in
> >>> language in general in the dictionary.
> >>>
> >>> mind you - i AM talking about illume's keyboard, its algorithms as is in
> >>> the image i built. if you use something else i cannot comment as it's
> >>> something else.
> >>>
> >>>> (relatively low) and instead appears to look for the word with the
> >>>> closest minimum edit distance to the user's input. This is nuts. I have
> >>>> never -
> >>> it's not - as the edit distance is the likelihood of error. you likely
> >>> press the key you want - or near it. thus keys near where you pressed are
> >>> more likely than those further away. to limit search distance only up to a
> >>> certain distance is searched. chances are that you do this:
> >>>
> >>> fingerprint:
> >>>   ___
> >>>  /~~~\
> >>>  |~~~|
> >>>  |~~~|
> >>>   \x/
> >>>    "
> >>>
> >>> where "x" is the pressure point reported on the touchscreen. the only info
> >>> the touchscreen reports is the pressure point - nothing else. you think u
> >>> press somewhere else, but don't. you know what u pressed bu what key "pops
> >>> up" that lets u know pretty well how good your pressing of the screen is.
> >>> this is just a hardware limit of a resistive touchscreen. the point of
> >>> greatest pressure is used - not the middle point of the area in which skin
> >>> contacts the screen. get the gpe-sketchbook and try press with the flat of
> >>> your finger and see just of far off your press point is. it may surprise
> >>> you.
> >>>
> >>> as i said - it does have all the model and code and even data to do proper
> >>> correction based on many factors. i do NOT have a dictionary with
> >>> frequency info for all of english - there is a "small" english dict (5000
> >>> words) with some frequency info in it i managed to gather, but its very
> >>> small.
> >>>
> >>> if you don't believe me - read the code, or do better. patches accepted,
> >>> but i think the problem is just that the dictionary has no frequency info
> >>> by default (a matter of simple lack of data) or how you press the screen.
> >>> i suggest you pay close attention to how you type and see. yes the "black
> >>> word" (in the black box) may not be always the word u want - but its most
> >>> often that word or a word right next to it - as you use it it will learn.
> >>> if you are using it for non-english stuff then you need a different
> >>> dictionary.
> >>>
> >>>> literally - gotten the word I typed in. In the common use case, of a user
> >>>> who enters a correct word, it invariably get it wrong.
> >>>>
> >>>> Understanding what it's doing doesn't make it less of a nuisance.
> >>> it does have concept of frequency orf words. i just dont have any DATA for
> >>> that. the dict format handles is:
> >>> word1
> >>> word2
> >>> word3
> >>>
> >>> OR
> >>> word1 20
> >>> word2 434
> >>> word3 1
> >>>
> >>> etc.
> >>> look at the personal dict file. ~/.e/e/dicts-dynamic/personal.dic
> >>>
> >>> it saves usage frequency. this affects lookup likelihood. btw - for me it
> >>> gets the word most of the time or the word is not the most likely but at
> >>> least listed as one of the most likely. use it for a bit and it learns and
> >>> gets better. if you wish to generate a dictionary with frequency info -
> >>> please do so! i made it really easy.
> >>>
> >> _______________________________________________
> >> Openmoko community mailing list
> >> community at lists.openmoko.org
> >> http://lists.openmoko.org/mailman/listinfo/community
> >>
> > 
> > 
> 

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com