New Rasterman Image...

"Marco Trevisan (Treviño)" mail at 3v1n0.net
Wed Oct 8 19:15:44 CEST 2008


Carsten Haitzler (The Rasterman) ha scritto:
> On Thu, 02 Oct 2008 20:52:00 +0200 "Marco Trevisan (Treviño)" <mail at 3v1n0.net>
> babbled:
>> So this is a little utility I wrote [1] to check the frequency of each
>> word and writing back a new dictionary with frequency data.
>>
>> To run it you need php-cli (I guess v5 or above), set the given options,
>> do "php words-popularity.php" and wait the work to be finished! :P
>>
>> It could be a long work, but it should give good results.
> 
> yes. it would. who wants to run it? :)

I've done it for about 420000 words. Divinding the work in 5 shells went
quite fine and took few hours, but now Google blocked it. I didn't know
that I wasn't allowed to do it :/.
I figure we should change our source :P.

> nb. i checked illume's kbd code - it does have issues with utf8 keysequences in
> sorted dicts. if you have any it'll fail to keep looking for more words so you
> need to remove anything utf8 from your dict :( yes - i know. bad. i need to
> address this. and the change in dict format i am sure 1. makes this now simple,
> 2. compresses the dict, 3. speeds it up, 4. solves this problem. :) but i just
> need to do it - no time right now :(

Yes, I do agree with this. Using a better compressed format would
increase the performances allowing to add more words. I think that the
qtopia dawg format is a good example for this.

I just hope you'll find some time for it soon :P.

-- 
Treviño's World - Life and Linux
http://www.3v1n0.net/





More information about the community mailing list