[wikireader]Error on parsing the spanish wikipedia

Sean Moss-Pultz sean at openmoko.com
Fri Oct 30 00:20:51 CET 2009


David

We're working on exactly the same thing now :-)

I'll ask Chris to email the list once we get past it. I think the
problem is with the mixtures of different encodings (latin-1 and
UTF-8) in the Spanish Wikipedia and the way our code is handling this.
For some reason Python's print  (at times) wants to default to ascii,
even after we explicitly tell it to use UTF-8.

  -Sean


On Fri, Oct 30, 2009 at 4:50 AM, David Reyes Samblas Martinez
<david at tuxbrain.com> wrote:
>
> Hi I'm trying to generate the file for a spainsh wikipedia on the WR ,
> after compiling succsesfuly the source on the git and solve some
> annoyings with utf8 encoding on phyton error was somthing like this:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
> position....: ordinal not in range(128)
> this was solved changing the default encode "ascii" to "utf8" int the
> /usr/lib/python2.6/site.py file
> after this I was hable to execute ok the instruction:
> make DESTDIR=image WORKDIR=work
> XML_FILES=xml-file-samples/eswiki-latest-pages-articles.xml index
> parse render combine
>
> Every thing seem fine for a couple(about 6-7h) of hours parsing the
> 700000 articles in spanish but  then ... the horror
> Count: 380000
> Traceback (most recent call last):
>  File "./ArticleParser.py", line 224, in <module>
>    main()
>  File "./ArticleParser.py", line 172, in main
>    process_article_text(title.encode('utf-8'),  f.read(length), newf)
>  File "./ArticleParser.py", line 218, in process_article_text
>    newf.write(text + '\n')
> IOError: [Errno 32] Broken pipe
> make[1]: *** [parse] Error 1
> make[1]: se sale del directorio
> `/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/host-tools/offline-renderer'
> make: *** [parse] Error 2
>
> I have relaunched the process again with the (few)hope that was a
> temporary fault but If any one has a clue will be helpfull.
>
> BTW.- I documenting all this proccess to make a step by step howto on
> how to put the wikipedia in other languages on the wikireader.
>
>
>
> David Reyes Samblas Martinez
> http://www.tuxbrain.com
> Open ultraportable & embedded solutions
> Openmoko, Openpandora,  Arduino
> Hey, watch out!!! There's a linux in your pocket!!!
>
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community



More information about the community mailing list