[wikireader]Error on parsing the spanish wikipedia

David Reyes Samblas Martinez david at tuxbrain.com
Thu Oct 29 21:50:43 CET 2009


Hi I'm trying to generate the file for a spainsh wikipedia on the WR ,
after compiling succsesfuly the source on the git and solve some
annoyings with utf8 encoding on phyton error was somthing like this:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
position....: ordinal not in range(128)
this was solved changing the default encode "ascii" to "utf8" int the
/usr/lib/python2.6/site.py file
after this I was hable to execute ok the instruction:
make DESTDIR=image WORKDIR=work
XML_FILES=xml-file-samples/eswiki-latest-pages-articles.xml index
parse render combine

Every thing seem fine for a couple(about 6-7h) of hours parsing the
700000 articles in spanish but  then ... the horror
Count: 380000
Traceback (most recent call last):
  File "./ArticleParser.py", line 224, in <module>
    main()
  File "./ArticleParser.py", line 172, in main
    process_article_text(title.encode('utf-8'),  f.read(length), newf)
  File "./ArticleParser.py", line 218, in process_article_text
    newf.write(text + '\n')
IOError: [Errno 32] Broken pipe
make[1]: *** [parse] Error 1
make[1]: se sale del directorio
`/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/host-tools/offline-renderer'
make: *** [parse] Error 2

I have relaunched the process again with the (few)hope that was a
temporary fault but If any one has a clue will be helpfull.

BTW.- I documenting all this proccess to make a step by step howto on
how to put the wikipedia in other languages on the wikireader.



David Reyes Samblas Martinez
http://www.tuxbrain.com
Open ultraportable & embedded solutions
Openmoko, Openpandora,  Arduino
Hey, watch out!!! There's a linux in your pocket!!!



More information about the community mailing list