[WikiReader] Sharing compiling sources.

Thomas Hocedez thomas.hocedez at free.fr
Sat Nov 28 00:17:32 CET 2009


Don't be sorry, everyone can have a proffesionnal&personnal life
too ;-) !

As I'm not familiar to Python and in the cul-de-sac I was, the
retry-without-parsing-faulty-article was the solution I choose too, but
without leading to do It. Thanks you showed me.

French Wikipedia parsing crashed just after "Tintin and Milou" (I don't
know if you know that comic strip -it might change name in spanish-,
it's about a reporter leading investigations). I'll read the wikipedia
page to find what could be wrong with it. 
The wrong page is quite big and was a "discussion' page, not really
funny to read. I found what might be wrong : it contains the spanish
reversed question mark (? but upside down).other pages (for now)
presents lot of strange uppercase accented letters (up S with cute
accent, E too ...) If this can help...


With your help, a first (but incomplete) release will be generated this
weekend ! (My friends won't look me as a silly guy reading US wikipedia
in the closet)

good night, thanks again, my computer will do the rest. (and 'ill put a
eye tomorrow on the problems)

cheers

Thomas



Le vendredi 27 novembre 2009 à 23:13 +0100, David Reyes Samblas Martinez
a écrit :
> Sorry for the wait Thomas,
> I was working to solve the broken pipe issue that stops the parser
> when it finds an error. I have applied a quick and dirty workaround
> using try-catch technique and now the process will not stop  and just
> skip the faulty article and keeps going :) it logs the faulty ones in
> a text file (title and position) for posterior forensics, but my first
> guesses in that is not a codification issue with utf8 is more an
> unexpected formating tag the php parser don't know how to deal with
> Actually parsing the german wikipedia with more than 1.3 million articles
> 
> Count: 1043000
> Failing count: 2
> 
> and keeps going I supose we can sacrificate two articles for having
> one milion available now :)
> 
> as you requested I uploaded my working compiled tools[1]  but without
> any xml sources it's about 113Mb, but if you have a working tools on
> your system you just have to change
> host-tools/offline-renderer/ArticleParser.py by the attached on this
> mail and you can forget to cry like a child that his ice cream has
> fall to the floor when after more than 24h parsing hundred of thousand
> articles pased the process you see this ugly python error backtrace
> blablabla and not your desired file :)
> 
> by the way the faultyarticles.txt is saved at same
> host-tools/offline-renderer directory, (i'm too lazy to put a
> parameter for change that and I hardcoded the name of the file ,
> yes... don't waste typing on correct that bad habit, I know)
> 
> If you have curiosity of what articles on the german wiki are causing troubles
> on dewiki-latest-pages-articles.xml (date 2009-11-20)
> 
> ~Storck Bicycle
> 832673
> ~Musculus serratus posterior inferior
> 857334
> 
> Regards I hope I will upload the German wikipedia on Sunday... and
> will be available on Monday, sorry for the wait but my Asymmetric DSL
> is very asymmetric and upload 1.5-2 Gb (expected file size) will take
> a bunch of hours.
> 
> For those than wants to compile his own , go for it :) the
> Quickreference in the doc directory on the souce is all you need to
> start working,  just remember than if you have a 64 bit system you
> will have to follow the 64 bits method to compile the tools,
> 
> Regards
> [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2
> David Reyes Samblas Martinez
> http://www.tuxbrain.com
> Open ultraportable & embedded solutions
> Openmoko, Openpandora,  Arduino
> Hey, watch out!!! There's a linux in your pocket!!!
> 
> 
> 
> 
> 2009/11/27 Thomas HOCEDEZ <thomas.hocedez at free.fr>:
> > Thomas HOCEDEZ a écrit :
> >>
> >> Hi DAvid,
> >>
> >> Can you share your scripts & configs to do the same in French (and other
> >> languages) ?
> >> Thanks
> >>
> >> Thomas
> >>
> >>
> >
> > As the Mailing list seems to be broken (or users started hibernating for
> > winter...) I find by myself the way to compile things step by step.
> > I'm for now rendering the French Wikipedia. As it started a few minutes ago,
> > the result will be availabel during the weekend (I hope).
> >
> > I'll also post the way I managed to do so ! (I'm at the office for now, and
> > I'm leaving...)
> >
> > Regards to you all !
> >
> > Thomas
> >




More information about the community mailing list