[wikireader]Error on parsing the spanish wikipedia

Sean Moss-Pultz sean at openmoko.com
Fri Oct 30 15:32:36 CET 2009


On Fri, Oct 30, 2009 at 4:50 AM, David Reyes Samblas Martinez
<david at tuxbrain.com> wrote:
> Hi I'm trying to generate the file for a spainsh wikipedia on the WR ,
> after compiling succsesfuly the source on the git and solve some
> annoyings with utf8 encoding on phyton error was somthing like this:
> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in
> position....: ordinal not in range(128)
> this was solved changing the default encode "ascii" to "utf8" int the
> /usr/lib/python2.6/site.py file
> after this I was hable to execute ok the instruction:
> make DESTDIR=image WORKDIR=work
> XML_FILES=xml-file-samples/eswiki-latest-pages-articles.xml index
> parse render combine
>
> Every thing seem fine for a couple(about 6-7h) of hours parsing the
> 700000 articles in spanish but  then ... the horror
> Count: 380000
> Traceback (most recent call last):
>  File "./ArticleParser.py", line 224, in <module>
>    main()
>  File "./ArticleParser.py", line 172, in main
>    process_article_text(title.encode('utf-8'),  f.read(length), newf)
>  File "./ArticleParser.py", line 218, in process_article_text
>    newf.write(text + '\n')
> IOError: [Errno 32] Broken pipe
> make[1]: *** [parse] Error 1
> make[1]: se sale del directorio
> `/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/host-tools/offline-renderer'
> make: *** [parse] Error 2

OK that's fixed now. Chris already checked in the code. Our build
worked fine. We need to do a few more tweaks and then we can post a
(super) early test image. Give us until early this coming week.

  -Sean



More information about the community mailing list