Illume dictionary for Dutch (Nederlands)

Carsten Haitzler (The Rasterman) raster at rasterman.com
Thu Nov 20 00:14:22 CET 2008


On Wed, 19 Nov 2008 23:25:22 +0100 Pander <pander at users.sourceforge.net>
babbled:

> Hi all,
> 
> Together with http://opentaal.org , I'm working on a special Illume
> dictionary for Dutch word completion. It will be available in the near
> future.
> 
> Of course this particular word list is very long and contains about
> 250,000 words and has a typical loooong tail. Many words or compositions
> or occur seldom in average day use.
> 
> What would be a good cut off point in number of words, also in terms of
> performance?
> 
> The Portuguese list contains 56,609 words. Is this workable? How many
> does the English contain?

english is about 98,000, but remember english has very few changes in words for
conjugation. i need to change the dict format to account for this and compress
better i think. i do need to make a different entered text -> visible word
mapping tho. this covers blind qwerty entry for accented words. i.e.:

(german)
fass -> Faß
brotchen -> Brötchen

(french)
cafe -> café
etage -> étage
francais -> Français

(japanese)
sakana -> さかな | 魚 | 肴 | 坂な | 茶菓な | 阪な | 差かな | 左かな | 差かな  |
査かな | 鎖かな | サカナ | sakana

note that in some languages can have 1 romanised input match multiple
(different) displays of that word (japanese is king at this. chinese likely if
using pinyin could be similar). right now the dict format doesn't allow for
this and sure- i can extend with a list of displayed words so currently
non-freq format is:

cafe
etage

with freq:

cafe 126
etage 98

i can add a display list:

cafe 126 cafe café
etage 98 étage

but the file will get bigger and bigger and get harder to auto-generate from
input data. right now i am unsure of the exact strategy to take... but i'd like
to cover as many languages as i can with 1 format and have minimal dict size
overhead etc.

-- 
------------- Codito, ergo sum - "I code, therefore I am" --------------
The Rasterman (Carsten Haitzler)    raster at rasterman.com





More information about the community mailing list