[wikireader]Error on parsing the spanish wikipedia

David Reyes Samblas Martinez david at tuxbrain.com
Fri Oct 30 00:54:50 CET 2009


Great! :) good to see you are working on this!, please count on me for
any testing to be done, I will try to make a look on the code myself
to kill the bug but no time and nor expertise so no promises :P
David Reyes Samblas Martinez
http://www.tuxbrain.com
Open ultraportable & embedded solutions
Openmoko, Openpandora,  Arduino
Hey, watch out!!! There's a linux in your pocket!!!




2009/10/30 Sean Moss-Pultz <sean at openmoko.com>:
> David
>
> We're working on exactly the same thing now :-)
>
> I'll ask Chris to email the list once we get past it. I think the
> problem is with the mixtures of different encodings (latin-1 and
> UTF-8) in the Spanish Wikipedia and the way our code is handling this.
> For some reason Python's print  (at times) wants to default to ascii,
> even after we explicitly tell it to use UTF-8.
>
>  -Sean
>
>
> On Fri, Oct 30, 2009 at 4:50 AM, David Reyes Samblas Martinez
> <david at tuxbrain.com> wrote:
>>
>> Hi I'm trying to generate the file for a spainsh wikipedia on the WR ,
>> after compiling succsesfuly the source on the git and solve some
>> annoyings with utf8 encoding on phyton error was somthing like this:
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
>> position....: ordinal not in range(128)
>> this was solved changing the default encode "ascii" to "utf8" int the
>> /usr/lib/python2.6/site.py file
>> after this I was hable to execute ok the instruction:
>> make DESTDIR=image WORKDIR=work
>> XML_FILES=xml-file-samples/eswiki-latest-pages-articles.xml index
>> parse render combine
>>
>> Every thing seem fine for a couple(about 6-7h) of hours parsing the
>> 700000 articles in spanish but  then ... the horror
>> Count: 380000
>> Traceback (most recent call last):
>>  File "./ArticleParser.py", line 224, in <module>
>>    main()
>>  File "./ArticleParser.py", line 172, in main
>>    process_article_text(title.encode('utf-8'),  f.read(length), newf)
>>  File "./ArticleParser.py", line 218, in process_article_text
>>    newf.write(text + '\n')
>> IOError: [Errno 32] Broken pipe
>> make[1]: *** [parse] Error 1
>> make[1]: se sale del directorio
>> `/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/host-tools/offline-renderer'
>> make: *** [parse] Error 2
>>
>> I have relaunched the process again with the (few)hope that was a
>> temporary fault but If any one has a clue will be helpfull.
>>
>> BTW.- I documenting all this proccess to make a step by step howto on
>> how to put the wikipedia in other languages on the wikireader.
>>
>>
>>
>> David Reyes Samblas Martinez
>> http://www.tuxbrain.com
>> Open ultraportable & embedded solutions
>> Openmoko, Openpandora,  Arduino
>> Hey, watch out!!! There's a linux in your pocket!!!
>>
>> _______________________________________________
>> Openmoko community mailing list
>> community at lists.openmoko.org
>> http://lists.openmoko.org/mailman/listinfo/community
>
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community
>



More information about the community mailing list