Illume keyboard dictionary sorting and normalization

Carsten Haitzler (The Rasterman) raster at rasterman.com
Wed Jan 7 00:25:39 CET 2009


On Tue, 06 Jan 2009 15:43:35 +0100 Pander <pander at users.sourceforge.net>
babbled:

> Carsten Haitzler (The Rasterman) wrote:
> > On Tue, 6 Jan 2009 11:49:55 +0100 "Olof Sjobergh" <olofsj at gmail.com>
> > babbled:
> > 
> >> Hi,
> >>
> >> I'm working on a Swedish dictionary and keyboard for Illume, but I'm
> >> having some trouble with sorting of utf8 chars in the dictionary. I
> >> can't seem to get the sorting right. Looking at the code, Illume sorts
> >> the dictionary after first normalizing the strings according to the
> >> internal normalization table. Is there any way to reproduce this
> >> sorting with the sort command? I've tried with a few different locales
> >> (C, en_US.utf8) which all make the unix sort command work differently.
> >> But no matter what I try words don't show up correctly.
> > 
> > sort -f i think does it... i think...
> > 
> >> Another issue I found is that the built in normalization table is not
> >> very good for typing Swedish text. On a standard Swedish qwerty
> >> layout, we have three additional letters (å, ä and ö). These are used
> >> very frequently in Swedish and there are many common words that have
> >> different meanings if spellt with a, å or ä (for example har, här and
> >> hår are all very common words). But in Illume these are all normalized
> >> to a. Writing Swedish with a US qwerty layout and then having to
> >> select aåä manually after the dictionary lookup is a pain, since many
> >> common words will have to be selected from the lookup list each time.
> >>
> >> Instead, what you want is a Swedish qwerty layout (which is very
> >> simple to implement as a .kbd file), and not normalize åäö for the
> >> Swedish dictionary lookup. So the normalization table would really
> >> need to be configurable, either as a part of the dictionary or the
> >> .kbd file. I suppose this problem exists for other languages as well.
> >> If I were to work on such a change, what would be the best approach?
> > 
> > hmm interesting i was just going of german/french and portuguese on this
> > where i thought i could get away with simple normalisation and a basic
> > qwerty layout
> > - with selecting the matches (Vogel/Vögel for example). making the table
> > part of the dictionary does make a lot of sense of course. the dict format
> > does need to change to make it a lot faster and intl-char friendly. i
> > avoided this at the time as i'd need to efficiently encode a b-tree in the
> > file and be able to mmap () it efficiently and use it.
> 
> Mapping of cafe to café (French) and Vogel to Vögel (German) is indeed
> handy, this funcitonality would be handy internationally for most languages.
> 
> What about mapping Koeln to Köln etcetera? This would be handy for
> German only. Like the above story is (maybe) specific for Swedish.

yup. i've gone over this before. i think the solution is a dict change. you
have a match string and a list of possible outputs:

vogel -> Vogel,Vögel
koln -> Köln
koeln -> Köln

etc. etc. - this allows arbitrary mappings from 1 string to any other. should
cover a whole HOST of languages (japanese, chines and korean included if using
the romanised input methods of these languages). again - whole dict format
change would be needed and it'd be much harder to crate dicts.

> Perhaps an optional config file can be provided for the dictionaries
> that need one. Keeping this info outside the dict itself eases sorting
> of the dict and upgrading dicts. I would keep this optional config
> surely independent of the .kbd keyboard configs.
> 
> Raster, the dicts I'm making for Dutch will be a large version (250.000
> words) and a small version. Do you have an indication how many words is
> advisable for the small version?

you don't really need a small one - the small english one i used 1. because it
was simpler to check my match results in a small set of data and it used less
ram in my initial "in memory only" dict code. in the end there likely need a
major dict format and data content change to basically support all this stuff.
but once done it should cover a whole slew of languages.

> However it would be desirable that each .kbd file can indicate:
> - predictive mode is not possible, e.g. for numeric keyboards. I don't
> want it to remember my PIN, credit card number, etcetera. (numeric
> keyboard, a real one, without the é, ë, ..)

outputting keysyms instead of strings (like Terminal.kbd) bypasses the dict. so
this is how it is effectively turned off.

> - predictive mode is default on, but user can temporarily disable it,
> e.g. when going into a shell (alpha keyboard)

that's what Terminal.kbd is for... ?

> - predictive mode is defaul off, but user can temporarily enable it,
> e.g. when typing proza inside a shell (terminal keyboard)

of course this can be done - the problem is - where do i "conveniently" attach
all the controls. i guess if no word is composed currently ^ on the top-left
can pop up a control panel.

but for now - kbd is not on my radar - got other things to do at the moment. :(

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com





More information about the community mailing list