[WikiReader] Sharing compiling sources.

Tim Besard tim.besard at gmail.com
Mon Nov 30 16:01:39 CET 2009


Hi,

It seems that the Dutch wikipedia contains some UTF-8 only characters,
which crashes the parser after all due to the "system echo" in the
exception handler. Changing the offending line to
    os.system('echo \"%s\" >> fault_articles.txt' %
title.encode("utf8"))
fixes the issue.

Tim

Op maandag 30-11-2009 om 14:49 uur [tijdzone +0100], schreef David Reyes
Samblas Martinez:
> Here you have :)
> 
> David Reyes Samblas Martinez
> http://www.tuxbrain.com
> Open ultraportable & embedded solutions
> Openmoko, Openpandora,  Arduino
> Hey, watch out!!! There's a linux in your pocket!!!
> 
> 
> 
> 
> 2009/11/30 Tilman Baumann <tilman at baumann.name>:
> > Hi,
> >
> > can you maybe release this as a patch?
> > I like to inegrate this in github. But I fear I might miss something if I
> > try to fiddle out the changes by hand.
> >
> > Thanks
> >
> > David Reyes Samblas Martinez wrote:
> >> Sorry for the wait Thomas,
> >> I was working to solve the broken pipe issue that stops the parser
> >> when it finds an error. I have applied a quick and dirty workaround
> >> using try-catch technique and now the process will not stop  and just
> >> skip the faulty article and keeps going :) it logs the faulty ones in
> >> a text file (title and position) for posterior forensics, but my first
> >> guesses in that is not a codification issue with utf8 is more an
> >> unexpected formating tag the php parser don't know how to deal with
> >> Actually parsing the german wikipedia with more than 1.3 million articles
> >>
> >> Count: 1043000
> >> Failing count: 2
> >>
> >> and keeps going I supose we can sacrificate two articles for having
> >> one milion available now :)
> >>
> >> as you requested I uploaded my working compiled tools[1]  but without
> >> any xml sources it's about 113Mb, but if you have a working tools on
> >> your system you just have to change
> >> host-tools/offline-renderer/ArticleParser.py by the attached on this
> >> mail and you can forget to cry like a child that his ice cream has
> >> fall to the floor when after more than 24h parsing hundred of thousand
> >> articles pased the process you see this ugly python error backtrace
> >> blablabla and not your desired file :)
> >>
> >> by the way the faultyarticles.txt is saved at same
> >> host-tools/offline-renderer directory, (i'm too lazy to put a
> >> parameter for change that and I hardcoded the name of the file ,
> >> yes... don't waste typing on correct that bad habit, I know)
> >>
> >> If you have curiosity of what articles on the german wiki are causing
> >> troubles
> >> on dewiki-latest-pages-articles.xml (date 2009-11-20)
> >>
> >> ~Storck Bicycle
> >> 832673
> >> ~Musculus serratus posterior inferior
> >> 857334
> >>
> >> Regards I hope I will upload the German wikipedia on Sunday... and
> >> will be available on Monday, sorry for the wait but my Asymmetric DSL
> >> is very asymmetric and upload 1.5-2 Gb (expected file size) will take
> >> a bunch of hours.
> >>
> >> For those than wants to compile his own , go for it :) the
> >> Quickreference in the doc directory on the souce is all you need to
> >> start working,  just remember than if you have a 64 bit system you
> >> will have to follow the 64 bits method to compile the tools,
> >>
> >> Regards
> >> [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2
> >> David Reyes Samblas Martinez
> >> http://www.tuxbrain.com
> >> Open ultraportable & embedded solutions
> >> Openmoko, Openpandora,  Arduino
> >> Hey, watch out!!! There's a linux in your pocket!!!
> >>
> >>
> >>
> >>
> >> 2009/11/27 Thomas HOCEDEZ <thomas.hocedez at free.fr>:
> >>> Thomas HOCEDEZ a écrit :
> >>>>
> >>>> Hi DAvid,
> >>>>
> >>>> Can you share your scripts & configs to do the same in French (and
> >>>> other
> >>>> languages) ?
> >>>> Thanks
> >>>>
> >>>> Thomas
> >>>>
> >>>>
> >>>
> >>> As the Mailing list seems to be broken (or users started hibernating for
> >>> winter...) I find by myself the way to compile things step by step.
> >>> I'm for now rendering the French Wikipedia. As it started a few minutes
> >>> ago,
> >>> the result will be availabel during the weekend (I hope).
> >>>
> >>> I'll also post the way I managed to do so ! (I'm at the office for now,
> >>> and
> >>> I'm leaving...)
> >>>
> >>> Regards to you all !
> >>>
> >>> Thomas
> >>>
> >> _______________________________________________
> >> Openmoko community mailing list
> >> community at lists.openmoko.org
> >> http://lists.openmoko.org/mailman/listinfo/community
> >>
> >
> >
> > --
> >
> >
> >
> > _______________________________________________
> > Openmoko community mailing list
> > community at lists.openmoko.org
> > http://lists.openmoko.org/mailman/listinfo/community
> >
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community





More information about the community mailing list