ASU keyboards, again
"Marco Trevisan (Treviño)"
mail at 3v1n0.net
Thu Aug 28 12:33:44 CEST 2008
Carsten Haitzler (The Rasterman) wrote:
> On Thu, 28 Aug 2008 04:37:23 +0200 "Marco Trevisan (Treviño)" <mail at 3v1n0.net>
>> Carsten Haitzler (The Rasterman) wrote:
>>> yup yup. don't worry - i understand why :) i speak several langauges myself
>>> (not italian - but i did study latin, and speak french, german, english,
>>> japanese, some usable level of portuguese). i definitely get the language
>>> issues - for both european and asian languages :) yes. the above would
>>> reduce dictionary size. it would make parsing it much harder.
>> I suspected this :/. I did hoped to be wrong...
> i am thinking about this... i have some ideas that may improve this... this is
> my thought train:
> right now format is either:
> word 123\n
> word 23\n
> (sorted case-insensitive).
> the numbers are "frequency of use" so those used more will have more primary
> position in the match list
> 1. add a line skip byte at the start of the line - means skipping to the next
> line will be much faster (just jump N bytes as per the byte - if line > 255
> bytes then byte-jump == 0 and skip the slow way until newline (shouldn't be very
> 2. extend the line to be:
> word NNN match1 match2 match3 ~suffix1 ~suffix2\n
Ok, but from the other side this kind of format wouldn't consider the
frequency of use of subwords (words composed by the given suffix and a
prefix); this, actually, is one of the good point of the current scheme.
> so now we have the ability to match and "append" a suffix. suffix is ~XXX and
> full replacement words are just listed. this should remain fast as i only
> "lookup" on the first word on the line that is the initial match - so it
> builds a list of candidates. the problem is that once you exceed the "base" it
> needs to dynamically build matches for all combinations of base + extension.
> also for full replacements (as in the last 2 lines) it needs to be able to
> match these as well, so they end up being full entries too. the real problem is
> generating such a dictionary - i tried to keep the dict format so simple that
> it was trivial to generate. but it'd solve your problem.
Well, before of testing my heavy dictionary with illume, I hoped that it
would have worked well, but I knew that all this redundancy could have
caused a problem in parsing (both from the performance point of view and
from the memory usage one).
I figure that this kind of implementation could help in these situations
(that I don't think they're so uncommon, I guess that - at least for
other Latin-derived languages - the dictionary file would be really much
more greater than the ones in /usr/share/dict).
> anyway. if i am going to go expand the dictionary format, i really need to be
> careful. i kept it simple because i didn't want to solve the worlds dictionary
> problems - i did want to keep it basic but working. as best i can tell the OM
> userbase is still mainly western-speaking (yes - i know we have people here
> from asia! :) not forgetting! just looking at dealing with the majority first!)
> anyway... i am mulling this over. the byte-skip may solve some performance
> issues, but this means i now need a special dict generator tool. i was trying
> to avoid that :(
Yes I figured this. Maybe you could support multiple formats (the
current scheme and the improved one). Majority wuldn't need a tool to
generate the dict except "sort -f".
> as per above - your idea of having a list of suffixes lef me on
> the above path. i have a feeling it still isn't perfect, but it's an
> improvement. it means the dict now knows about prefix and suffix and so when u
> type the "root" of a word that is conjugated, the dict can even offer the
> conjugated forms as matches. that's good for western langauges
Yes, it is and then would be easier to predict (!= typo fix) too.
Treviño's World - Life and Linux
More information about the community