ASU keyboards, again

Thu Aug 28 01:50:25 CEST 2008

On Wed, 27 Aug 2008 23:12:59 +0200 "Marco Trevisan (Treviño)" <mail at 3v1n0.net>
babbled:

> Carsten Haitzler (The Rasterman) wrote:
> > On Wed, 27 Aug 2008 16:00:50 +0200 "Marco Trevisan (Treviño)"
> > <mail at 3v1n0.net> babbled:
> >> Well, generally for small words there's a correction list, but it's not 
> >> always complete and often there are words very different from the one 
> >> I'd like to write, but not that one. So maybe it doesn't search in all 
> >> the dictionary. I could I try that?
> >> However my fingers are not so great...
> >> If you want I can send you my dictionaries, so you'll be able to test 
> >> them in a better way.
> > 
> > hmm. is this english?i am wondering if non-ascii chars are messing it up or
> > not. your dictionary may be useful - i have just been going off my 98,000
> > or so entry dict from /usr/share/dict/words which seems to be big enough
> > for me it seems and has pretty much everything in it... for english anyway.
> > as its used for spellchecking i kind of assumed it'd be good enough for
> > typing up sms's and emails :) at least in my tests it is listing all the
> > completions i'd expect it to. did you sort -f the illume dict?
> > (non-case-sensitive sort)?
> 
> Yes it's sorted and it's an Italian dictionary (so few non-ascii chars); 
> that's why it has so many words. Consider that an Italian dictionary has 
> about 120000 words to be declined.
> So from a verb in the infinite form I can extract about 50 different 
> words, from names and from adjectives about 3 for each.
> But here (like in the more common occidental languages), in most cases, 
> only the suffix differs.
> 
> Imho, a way to reduce the size would be allowing a rule to set suffix 
> and prefix (for composed words) that would reduce the dictionary size.
> So, for example, in my dictionary instead of using 50 lines for each 
> verb I would use only one per one; i.e.:
> 
> Italian verb "parlare" (to talk) would be (not complete)
>   parl{o,i,a,iamo,ate,ano,avo,avi,ava,avamo,avate,avano,ai,asti,ò,ammo, \
>        aste,arono,erò,erai,erà,eremo,erete,eranno,erei,eresti,erebbe, \
>        eremmo,ereste,erebbero,ii,iamo,iate,ino,assi,asse,assimo, \
>        assero,ino,ando,ante,ato,ata,ati}
> 
> Italian noun "casa" (house) would be
>   cas{a,e}
> 
> Italian adjective "libero" (free [as freedom]) would be
>   liber{a,i,o}

yup yup. don't worry - i understand why :) i speak several langauges myself
(not italian - but i did study latin, and speak french, german, english,
japanese, some usable level of portuguese). i definitely get the language
issues - for both european and asian languages :) yes. the above would reduce
dictionary size. it would make parsing it much harder.

right now its nice and simple and should work with pretty much every language i
can think of that doesnt use input methods and composition (ie japanese/chinese
where you use romanji or pinyin as phonetic representations of words).

the good bit is:

1. i can mmap() the file trivially.
2. i can build a quick lookup table by scanning through lines and the first 2
chars of each line - use this 2 char "hash" lookup to jump quickly to my mmaped
point - then do a (hopefully short) linear search. i keep the search results
iteratively so this means it will start where it left off last time to save
more walking.

> BTW I don't know if this would improve the keyboard typo-fixing work 
> (maybe yes if also the suffixes/[prefixes?] are sorted)

hmm - no. as long as it is sorted (case-insensitive) at all, then it should
work as the algorithm is simple.

> Anyway, let me know I should send you the dict I've.

it's italian - right?

> > illume's dict is 6mb? hmm i guess the raw text there has a lot of
> > redundancy :)
> Yes and this happens because of the things shown above. And I've made 
> only a part of the work; i guess that the final dictionary will double 
> this size. And it won't contain any proper name (City names, Sigles...).

hmm. ok. well apart from efficiency of dict size and search lengths a simple
dict-format dictionary should be able to work fine. maybe some utf8 handling
etc. is busted and words with accents get dumped or stopped at. what i do need
is a small set of examples to work from. i can create my own :) i never tested
anything with anything other than ascii chars (no accents/umlauts etc.) so
thats why i suspect them.

> Italian standard linux dictionary (/usr/share/dict/italian) "weights" 
> 1,2mb but it's mostly incomplete.

aaah. ok. i guess that's not great quality then :)

> > i tried to keep the dictionary simple in illume but am always willing to
> > look at other ways to improve it. though the keyboard is not really a focus
> > of mine
> > - it's something along the way so there may come a time when i go "well- you
> > want it better.. please.. send a patch!"... but its fresh on my plate now,
> > so it's active :)
> 
> And this is a great thing. Since this phone without a great virtual 
> keyboard (like the one you're doing) won't be usable/cool as it should 
> be. Imho this is the killer tool of illume.

thanks :) though really.. there is much more to illume :)

> >> Another thing I'd like to suggest you is that imho the backspace/space 
> >> right-left/left-right dragging is too long. If you try writing using 
> >> your thumbs you can notice that is hard deleting a word... Imho they 
> >> should be more sensible.
> > 
> > from illume's TODO file (in svn):
> > 
> > * kbd needs drag for backspace/next word etc. to be shorter
> > 
> > :) already there. :) well - as with accent normalising - there is a marker
> > that i realise something needs to be done.
> 
> Nice! :P

hehehe - i just haven't done it. that's all. accent char normalising is easy:

ñ -> n
é -> e
ö -> o

etc. - just strip any accent (and convert to lower case). what i was wondering
was:

æ -> ?
ß -> ? (maybe s?)

and some others where i dont have a simple 1 : 1 noramlisation mapping. so i
kept it a simple tolower() and put the FIXME in. :)

-- 
Carsten Haitzler (The Rasterman) <raster at openmoko.org>