Automatically splitting the planet by language
Alex Osborne
ato at meshy.org
Tue Oct 7 03:38:54 CEST 2008
Hello everyone,
I've been experimenting with splitting the planet's feed by language.
Here's the result:
http://meshy.org/~ato/planet_om/
As you can see it's at least doing a pretty good job with English,
French and German. I've got it on an hourly cron so we'll see if it
continues to perform so well. The planet_split.tar.gz contains the
source to the script. All I'm doing to detect the language is feeding
the text of each post into a Python port [1] of TextCat [2].
Cheers,
Alex
[1] http://thomas.mangin.me.uk/data/source/ngram.py
[2] http://odur.let.rug.nl/~vannoord/TextCat/
More information about the documentation
mailing list