[WikiReader] Sharing compiling sources.

David Reyes Samblas Martinez david at tuxbrain.com
Mon Nov 30 16:13:31 CET 2009


I have modified the python config to work with utf8 by default, that's
why I have no notice any conversion error, thanks for the tip Tim.
David Reyes Samblas Martinez
http://www.tuxbrain.com
Open ultraportable & embedded solutions
Openmoko, Openpandora,  Arduino
Hey, watch out!!! There's a linux in your pocket!!!




2009/11/30 Tim Besard <tim.besard at gmail.com>:
> Hi,
>
> It seems that the Dutch wikipedia contains some UTF-8 only characters,
> which crashes the parser after all due to the "system echo" in the
> exception handler. Changing the offending line to
>    os.system('echo \"%s\" >> fault_articles.txt' %
> title.encode("utf8"))
> fixes the issue.
>
> Tim
>
> Op maandag 30-11-2009 om 14:49 uur [tijdzone +0100], schreef David Reyes
> Samblas Martinez:
>> Here you have :)
>>
>> David Reyes Samblas Martinez
>> http://www.tuxbrain.com
>> Open ultraportable & embedded solutions
>> Openmoko, Openpandora,  Arduino
>> Hey, watch out!!! There's a linux in your pocket!!!
>>
>>
>>
>>
>> 2009/11/30 Tilman Baumann <tilman at baumann.name>:
>> > Hi,
>> >
>> > can you maybe release this as a patch?
>> > I like to inegrate this in github. But I fear I might miss something if I
>> > try to fiddle out the changes by hand.
>> >
>> > Thanks
>> >
>> > David Reyes Samblas Martinez wrote:
>> >> Sorry for the wait Thomas,
>> >> I was working to solve the broken pipe issue that stops the parser
>> >> when it finds an error. I have applied a quick and dirty workaround
>> >> using try-catch technique and now the process will not stop  and just
>> >> skip the faulty article and keeps going :) it logs the faulty ones in
>> >> a text file (title and position) for posterior forensics, but my first
>> >> guesses in that is not a codification issue with utf8 is more an
>> >> unexpected formating tag the php parser don't know how to deal with
>> >> Actually parsing the german wikipedia with more than 1.3 million articles
>> >>
>> >> Count: 1043000
>> >> Failing count: 2
>> >>
>> >> and keeps going I supose we can sacrificate two articles for having
>> >> one milion available now :)
>> >>
>> >> as you requested I uploaded my working compiled tools[1]  but without
>> >> any xml sources it's about 113Mb, but if you have a working tools on
>> >> your system you just have to change
>> >> host-tools/offline-renderer/ArticleParser.py by the attached on this
>> >> mail and you can forget to cry like a child that his ice cream has
>> >> fall to the floor when after more than 24h parsing hundred of thousand
>> >> articles pased the process you see this ugly python error backtrace
>> >> blablabla and not your desired file :)
>> >>
>> >> by the way the faultyarticles.txt is saved at same
>> >> host-tools/offline-renderer directory, (i'm too lazy to put a
>> >> parameter for change that and I hardcoded the name of the file ,
>> >> yes... don't waste typing on correct that bad habit, I know)
>> >>
>> >> If you have curiosity of what articles on the german wiki are causing
>> >> troubles
>> >> on dewiki-latest-pages-articles.xml (date 2009-11-20)
>> >>
>> >> ~Storck Bicycle
>> >> 832673
>> >> ~Musculus serratus posterior inferior
>> >> 857334
>> >>
>> >> Regards I hope I will upload the German wikipedia on Sunday... and
>> >> will be available on Monday, sorry for the wait but my Asymmetric DSL
>> >> is very asymmetric and upload 1.5-2 Gb (expected file size) will take
>> >> a bunch of hours.
>> >>
>> >> For those than wants to compile his own , go for it :) the
>> >> Quickreference in the doc directory on the souce is all you need to
>> >> start working,  just remember than if you have a 64 bit system you
>> >> will have to follow the 64 bits method to compile the tools,
>> >>
>> >> Regards
>> >> [1]http://tuxbrain.org/downloads/wikireader/wikireaderbinaries20091127_dsamblas_modified_trycatch.tar.bz2
>> >> David Reyes Samblas Martinez
>> >> http://www.tuxbrain.com
>> >> Open ultraportable & embedded solutions
>> >> Openmoko, Openpandora,  Arduino
>> >> Hey, watch out!!! There's a linux in your pocket!!!
>> >>
>> >>
>> >>
>> >>
>> >> 2009/11/27 Thomas HOCEDEZ <thomas.hocedez at free.fr>:
>> >>> Thomas HOCEDEZ a écrit :
>> >>>>
>> >>>> Hi DAvid,
>> >>>>
>> >>>> Can you share your scripts & configs to do the same in French (and
>> >>>> other
>> >>>> languages) ?
>> >>>> Thanks
>> >>>>
>> >>>> Thomas
>> >>>>
>> >>>>
>> >>>
>> >>> As the Mailing list seems to be broken (or users started hibernating for
>> >>> winter...) I find by myself the way to compile things step by step.
>> >>> I'm for now rendering the French Wikipedia. As it started a few minutes
>> >>> ago,
>> >>> the result will be availabel during the weekend (I hope).
>> >>>
>> >>> I'll also post the way I managed to do so ! (I'm at the office for now,
>> >>> and
>> >>> I'm leaving...)
>> >>>
>> >>> Regards to you all !
>> >>>
>> >>> Thomas
>> >>>
>> >> _______________________________________________
>> >> Openmoko community mailing list
>> >> community at lists.openmoko.org
>> >> http://lists.openmoko.org/mailman/listinfo/community
>> >>
>> >
>> >
>> > --
>> >
>> >
>> >
>> > _______________________________________________
>> > Openmoko community mailing list
>> > community at lists.openmoko.org
>> > http://lists.openmoko.org/mailman/listinfo/community
>> >
>> _______________________________________________
>> Openmoko community mailing list
>> community at lists.openmoko.org
>> http://lists.openmoko.org/mailman/listinfo/community
>
>
>
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community
>



More information about the community mailing list