ASU keyboards, again
Carsten Haitzler (The Rasterman)
raster at openmoko.org
Thu Aug 28 01:50:25 CEST 2008
On Wed, 27 Aug 2008 23:12:59 +0200 "Marco Trevisan (Treviño)" <mail at 3v1n0.net>
babbled:
> Carsten Haitzler (The Rasterman) wrote:
> > On Wed, 27 Aug 2008 16:00:50 +0200 "Marco Trevisan (Treviño)"
> > <mail at 3v1n0.net> babbled:
> >> Well, generally for small words there's a correction list, but it's not
> >> always complete and often there are words very different from the one
> >> I'd like to write, but not that one. So maybe it doesn't search in all
> >> the dictionary. I could I try that?
> >> However my fingers are not so great...
> >> If you want I can send you my dictionaries, so you'll be able to test
> >> them in a better way.
> >
> > hmm. is this english?i am wondering if non-ascii chars are messing it up or
> > not. your dictionary may be useful - i have just been going off my 98,000
> > or so entry dict from /usr/share/dict/words which seems to be big enough
> > for me it seems and has pretty much everything in it... for english anyway.
> > as its used for spellchecking i kind of assumed it'd be good enough for
> > typing up sms's and emails :) at least in my tests it is listing all the
> > completions i'd expect it to. did you sort -f the illume dict?
> > (non-case-sensitive sort)?
>
> Yes it's sorted and it's an Italian dictionary (so few non-ascii chars);
> that's why it has so many words. Consider that an Italian dictionary has
> about 120000 words to be declined.
> So from a verb in the infinite form I can extract about 50 different
> words, from names and from adjectives about 3 for each.
> But here (like in the more common occidental languages), in most cases,
> only the suffix differs.
>
> Imho, a way to reduce the size would be allowing a rule to set suffix
> and prefix (for composed words) that would reduce the dictionary size.
> So, for example, in my dictionary instead of using 50 lines for each
> verb I would use only one per one; i.e.:
>
> Italian verb "parlare" (to talk) would be (not complete)
> parl{o,i,a,iamo,ate,ano,avo,avi,ava,avamo,avate,avano,ai,asti,ò,ammo, \
> aste,arono,erò,erai,erà,eremo,erete,eranno,erei,eresti,erebbe, \
> eremmo,ereste,erebbero,ii,iamo,iate,ino,assi,asse,assimo, \
> assero,ino,ando,ante,ato,ata,ati}
>
> Italian noun "casa" (house) would be
> cas{a,e}
>
> Italian adjective "libero" (free [as freedom]) would be
> liber{a,i,o}
yup yup. don't worry - i understand why :) i speak several langauges myself
(not italian - but i did study latin, and speak french, german, english,
japanese, some usable level of portuguese). i definitely get the language
issues - for both european and asian languages :) yes. the above would reduce
dictionary size. it would make parsing it much harder.
right now its nice and simple and should work with pretty much every language i
can think of that doesnt use input methods and composition (ie japanese/chinese
where you use romanji or pinyin as phonetic representations of words).
the good bit is:
1. i can mmap() the file trivially.
2. i can build a quick lookup table by scanning through lines and the first 2
chars of each line - use this 2 char "hash" lookup to jump quickly to my mmaped
point - then do a (hopefully short) linear search. i keep the search results
iteratively so this means it will start where it left off last time to save
more walking.
> BTW I don't know if this would improve the keyboard typo-fixing work
> (maybe yes if also the suffixes/[prefixes?] are sorted)
hmm - no. as long as it is sorted (case-insensitive) at all, then it should
work as the algorithm is simple.
> Anyway, let me know I should send you the dict I've.
it's italian - right?
> > illume's dict is 6mb? hmm i guess the raw text there has a lot of
> > redundancy :)
> Yes and this happens because of the things shown above. And I've made
> only a part of the work; i guess that the final dictionary will double
> this size. And it won't contain any proper name (City names, Sigles...).
hmm. ok. well apart from efficiency of dict size and search lengths a simple
dict-format dictionary should be able to work fine. maybe some utf8 handling
etc. is busted and words with accents get dumped or stopped at. what i do need
is a small set of examples to work from. i can create my own :) i never tested
anything with anything other than ascii chars (no accents/umlauts etc.) so
thats why i suspect them.
> Italian standard linux dictionary (/usr/share/dict/italian) "weights"
> 1,2mb but it's mostly incomplete.
aaah. ok. i guess that's not great quality then :)
> > i tried to keep the dictionary simple in illume but am always willing to
> > look at other ways to improve it. though the keyboard is not really a focus
> > of mine
> > - it's something along the way so there may come a time when i go "well- you
> > want it better.. please.. send a patch!"... but its fresh on my plate now,
> > so it's active :)
>
> And this is a great thing. Since this phone without a great virtual
> keyboard (like the one you're doing) won't be usable/cool as it should
> be. Imho this is the killer tool of illume.
thanks :) though really.. there is much more to illume :)
> >> Another thing I'd like to suggest you is that imho the backspace/space
> >> right-left/left-right dragging is too long. If you try writing using
> >> your thumbs you can notice that is hard deleting a word... Imho they
> >> should be more sensible.
> >
> > from illume's TODO file (in svn):
> >
> > * kbd needs drag for backspace/next word etc. to be shorter
> >
> > :) already there. :) well - as with accent normalising - there is a marker
> > that i realise something needs to be done.
>
> Nice! :P
hehehe - i just haven't done it. that's all. accent char normalising is easy:
ñ -> n
é -> e
ö -> o
etc. - just strip any accent (and convert to lower case). what i was wondering
was:
æ -> ?
ß -> ? (maybe s?)
and some others where i dont have a simple 1 : 1 noramlisation mapping. so i
kept it a simple tolower() and put the FIXME in. :)
--
Carsten Haitzler (The Rasterman) <raster at openmoko.org>
More information about the community
mailing list