[WikiReader] Sharing compiling sources.

Tilman Baumann tilman at baumann.name
Mon Nov 30 12:03:57 CET 2009


can you maybe release this as a patch?
I like to inegrate this in github. But I fear I might miss something if I
try to fiddle out the changes by hand.


David Reyes Samblas Martinez wrote:
> Sorry for the wait Thomas,
> I was working to solve the broken pipe issue that stops the parser
> when it finds an error. I have applied a quick and dirty workaround
> using try-catch technique and now the process will not stop  and just
> skip the faulty article and keeps going :) it logs the faulty ones in
> a text file (title and position) for posterior forensics, but my first
> guesses in that is not a codification issue with utf8 is more an
> unexpected formating tag the php parser don't know how to deal with
> Actually parsing the german wikipedia with more than 1.3 million articles
> Count: 1043000
> Failing count: 2
> and keeps going I supose we can sacrificate two articles for having
> one milion available now :)
> as you requested I uploaded my working compiled tools[1]  but without
> any xml sources it's about 113Mb, but if you have a working tools on
> your system you just have to change
> host-tools/offline-renderer/ArticleParser.py by the attached on this
> mail and you can forget to cry like a child that his ice cream has
> fall to the floor when after more than 24h parsing hundred of thousand
> articles pased the process you see this ugly python error backtrace
> blablabla and not your desired file :)
> by the way the faultyarticles.txt is saved at same
> host-tools/offline-renderer directory, (i'm too lazy to put a
> parameter for change that and I hardcoded the name of the file ,
> yes... don't waste typing on correct that bad habit, I know)
> If you have curiosity of what articles on the german wiki are causing
> troubles
> on dewiki-latest-pages-articles.xml (date 2009-11-20)
> ~Storck Bicycle
> 832673
> ~Musculus serratus posterior inferior
> 857334
> Regards I hope I will upload the German wikipedia on Sunday... and
> will be available on Monday, sorry for the wait but my Asymmetric DSL
> is very asymmetric and upload 1.5-2 Gb (expected file size) will take
> a bunch of hours.
> For those than wants to compile his own , go for it :) the
> Quickreference in the doc directory on the souce is all you need to
> start working,  just remember than if you have a 64 bit system you
> will have to follow the 64 bits method to compile the tools,
> Regards
> [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2
> David Reyes Samblas Martinez
> http://www.tuxbrain.com
> Open ultraportable & embedded solutions
> Openmoko, Openpandora,  Arduino
> Hey, watch out!!! There's a linux in your pocket!!!
> 2009/11/27 Thomas HOCEDEZ <thomas.hocedez at free.fr>:
>> Thomas HOCEDEZ a écrit :
>>> Hi DAvid,
>>> Can you share your scripts & configs to do the same in French (and
>>> other
>>> languages) ?
>>> Thanks
>>> Thomas
>> As the Mailing list seems to be broken (or users started hibernating for
>> winter...) I find by myself the way to compile things step by step.
>> I'm for now rendering the French Wikipedia. As it started a few minutes
>> ago,
>> the result will be availabel during the weekend (I hope).
>> I'll also post the way I managed to do so ! (I'm at the office for now,
>> and
>> I'm leaving...)
>> Regards to you all !
>> Thomas
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community


More information about the community mailing list