Illume dictionary for Dutch (Nederlands)

Pander pander at users.sourceforge.net
Thu Nov 20 10:40:46 CET 2008


Hi all,

I intent to generate the following:
- a full list utf-8 (for 8 bit SMS and regular use, default)
- b full list utf-8 GSM 03.38[1] (for 7 bit SMS)
- c truncated list utf-8 (for 8 bit SMS and regular use)
- d truncated list utf-8 GSM 03.38[1] (for 7 bit SMS, default)

[1] These utf-8 characters in this list are within the 7-bit range of GSM
03.38, see http://en.wikipedia.org/wiki/Short_message_service#GSM Note
that more characters

a and b will both have 250,000 words
b will be conversion, remapping and normalisation of a
c and d are truncations and normalisation of respectively a and b

For utf-16, a simple conversion of the utf-8 files can be used, but I'll
leave this for now. This could result in two extra files.

Note that nor extended nor non-extended ASCII is available. Is this
desirable? This can result in four extra files.

So, I can come up with 10 different files. Which are according to you the
most useful?

Regards,

Pander

On Thu, November 20, 2008 08:58, Rui Miguel Silva Seabra wrote:
> On Thu, Nov 20, 2008 at 03:02:41AM +0100, "Marco Trevisan (Treviño)"
> wrote:
>> Pander wrote:
>> > Of course this particular word list is very long and contains about
>> > 250,000 words and has a typical loooong tail. Many words or
>> compositions
>> > or occur seldom in average day use.
>> >
>> > What would be a good cut off point in number of words, also in terms
>> of
>> > performance?
>> >
>> > The Portuguese list contains 56,609 words. Is this workable? How many
>> > does the English contain?
>>
>> The Italian one can count also 500'000 words (to be short), but I can
>> get a well working dictionary only using a smaller one (with about
>> 150'000 words that I've taken counting its google popularity).
>>
>> Btw I've written more complete posts about this on the list...
>
> Well, since my basis was based on a million words taken from the most
> printed daily newspaper in Portugal (I didn't count but still I removed
> a lot of non words like numbers, etc...) already with frequency data, my
> job was so much easier... :)
>
> As for writing SMS/text messages... I haven't found yet a word that
> wasn't there (in fact my problem is that it so often is the first of
> several matches so I have to use the menu on the left) but I must
> confess to not be one of those whose primary use of the phone is
> SMS/text!
>
> Rui
>
> --
> Frink!
> Today is Prickle-Prickle, the 32nd day of The Aftermath in the YOLD 3174
> + No matter how much you do, you never do enough -- unknown
> + Whatever you do will be insignificant,
> | but it is very important that you do it -- Gandhi
> + So let's do it...?
>
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community
>






More information about the community mailing list