Illume keyboard dictionary sorting and normalization

Pander pander at users.sourceforge.net
Tue Jan 6 15:43:35 CET 2009


Carsten Haitzler (The Rasterman) wrote:
> On Tue, 6 Jan 2009 11:49:55 +0100 "Olof Sjobergh" <olofsj at gmail.com> babbled:
> 
>> Hi,
>>
>> I'm working on a Swedish dictionary and keyboard for Illume, but I'm
>> having some trouble with sorting of utf8 chars in the dictionary. I
>> can't seem to get the sorting right. Looking at the code, Illume sorts
>> the dictionary after first normalizing the strings according to the
>> internal normalization table. Is there any way to reproduce this
>> sorting with the sort command? I've tried with a few different locales
>> (C, en_US.utf8) which all make the unix sort command work differently.
>> But no matter what I try words don't show up correctly.
> 
> sort -f i think does it... i think...
> 
>> Another issue I found is that the built in normalization table is not
>> very good for typing Swedish text. On a standard Swedish qwerty
>> layout, we have three additional letters (å, ä and ö). These are used
>> very frequently in Swedish and there are many common words that have
>> different meanings if spellt with a, å or ä (for example har, här and
>> hår are all very common words). But in Illume these are all normalized
>> to a. Writing Swedish with a US qwerty layout and then having to
>> select aåä manually after the dictionary lookup is a pain, since many
>> common words will have to be selected from the lookup list each time.
>>
>> Instead, what you want is a Swedish qwerty layout (which is very
>> simple to implement as a .kbd file), and not normalize åäö for the
>> Swedish dictionary lookup. So the normalization table would really
>> need to be configurable, either as a part of the dictionary or the
>> .kbd file. I suppose this problem exists for other languages as well.
>> If I were to work on such a change, what would be the best approach?
> 
> hmm interesting i was just going of german/french and portuguese on this where
> i thought i could get away with simple normalisation and a basic qwerty layout
> - with selecting the matches (Vogel/Vögel for example). making the table part
> of the dictionary does make a lot of sense of course. the dict format does need
> to change to make it a lot faster and intl-char friendly. i avoided this at the
> time as i'd need to efficiently encode a b-tree in the file and be able to mmap
> () it efficiently and use it.

Mapping of cafe to café (French) and Vogel to Vögel (German) is indeed
handy, this funcitonality would be handy internationally for most languages.

What about mapping Koeln to Köln etcetera? This would be handy for
German only. Like the above story is (maybe) specific for Swedish.

Perhaps an optional config file can be provided for the dictionaries
that need one. Keeping this info outside the dict itself eases sorting
of the dict and upgrading dicts. I would keep this optional config
surely independent of the .kbd keyboard configs.

Raster, the dicts I'm making for Dutch will be a large version (250.000
words) and a small version. Do you have an indication how many words is
advisable for the small version?

However it would be desirable that each .kbd file can indicate:
- predictive mode is not possible, e.g. for numeric keyboards. I don't
want it to remember my PIN, credit card number, etcetera. (numeric
keyboard, a real one, without the é, ë, ..)
- predictive mode is default on, but user can temporarily disable it,
e.g. when going into a shell (alpha keyboard)
- predictive mode is defaul off, but user can temporarily enable it,
e.g. when typing proza inside a shell (terminal keyboard)



> 
> 





More information about the community mailing list