Automatically splitting the planet by language

Alex Osborne ato at meshy.org
Tue Oct 7 03:38:54 CEST 2008


Hello everyone,

I've been experimenting with splitting the planet's feed by language. 
Here's the result:

http://meshy.org/~ato/planet_om/

As you can see it's at least doing a pretty good job with English,
French and German.  I've got it on an hourly cron so we'll see if it
continues to perform so well.  The planet_split.tar.gz contains the
source to the script.  All I'm doing to detect the language is feeding
the text of each post into a Python port [1] of TextCat [2].

Cheers,

Alex


[1] http://thomas.mangin.me.uk/data/source/ngram.py
[2] http://odur.let.rug.nl/~vannoord/TextCat/



More information about the documentation mailing list