[WikiReader] Sharing compiling sources.

David Reyes Samblas Martinez david at tuxbrain.com
Fri Nov 27 23:13:42 CET 2009


Sorry for the wait Thomas,
I was working to solve the broken pipe issue that stops the parser
when it finds an error. I have applied a quick and dirty workaround
using try-catch technique and now the process will not stop  and just
skip the faulty article and keeps going :) it logs the faulty ones in
a text file (title and position) for posterior forensics, but my first
guesses in that is not a codification issue with utf8 is more an
unexpected formating tag the php parser don't know how to deal with
Actually parsing the german wikipedia with more than 1.3 million articles

Count: 1043000
Failing count: 2

and keeps going I supose we can sacrificate two articles for having
one milion available now :)

as you requested I uploaded my working compiled tools[1]  but without
any xml sources it's about 113Mb, but if you have a working tools on
your system you just have to change
host-tools/offline-renderer/ArticleParser.py by the attached on this
mail and you can forget to cry like a child that his ice cream has
fall to the floor when after more than 24h parsing hundred of thousand
articles pased the process you see this ugly python error backtrace
blablabla and not your desired file :)

by the way the faultyarticles.txt is saved at same
host-tools/offline-renderer directory, (i'm too lazy to put a
parameter for change that and I hardcoded the name of the file ,
yes... don't waste typing on correct that bad habit, I know)

If you have curiosity of what articles on the german wiki are causing troubles
on dewiki-latest-pages-articles.xml (date 2009-11-20)

~Storck Bicycle
832673
~Musculus serratus posterior inferior
857334

Regards I hope I will upload the German wikipedia on Sunday... and
will be available on Monday, sorry for the wait but my Asymmetric DSL
is very asymmetric and upload 1.5-2 Gb (expected file size) will take
a bunch of hours.

For those than wants to compile his own , go for it :) the
Quickreference in the doc directory on the souce is all you need to
start working,  just remember than if you have a 64 bit system you
will have to follow the 64 bits method to compile the tools,

Regards
[1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2
David Reyes Samblas Martinez
http://www.tuxbrain.com
Open ultraportable & embedded solutions
Openmoko, Openpandora,  Arduino
Hey, watch out!!! There's a linux in your pocket!!!




2009/11/27 Thomas HOCEDEZ <thomas.hocedez at free.fr>:
> Thomas HOCEDEZ a écrit :
>>
>> Hi DAvid,
>>
>> Can you share your scripts & configs to do the same in French (and other
>> languages) ?
>> Thanks
>>
>> Thomas
>>
>>
>
> As the Mailing list seems to be broken (or users started hibernating for
> winter...) I find by myself the way to compile things step by step.
> I'm for now rendering the French Wikipedia. As it started a few minutes ago,
> the result will be availabel during the weekend (I hope).
>
> I'll also post the way I managed to do so ! (I'm at the office for now, and
> I'm leaving...)
>
> Regards to you all !
>
> Thomas
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ArticleParser.py_dsamblas_modified_try_catch.tar.bz2
Type: application/x-bzip2
Size: 2845 bytes
Desc: not available
Url : http://lists.openmoko.org/pipermail/community/attachments/20091127/a79077ef/attachment.bin 


More information about the community mailing list