[Wikireader] Error on processing the German Wikipedia

David Reyes Samblas Martinez david at tuxbrain.com
Fri Nov 20 18:25:57 CET 2009


yes :(
David Reyes Samblas Martinez
http://www.tuxbrain.com
Open ultraportable & embedded solutions
Openmoko, Openpandora,  Arduino
Hey, watch out!!! There's a linux in your pocket!!!




2009/11/20 Tilman Baumann <tilman at baumann.name>:
>
> David Reyes Samblas Martinez wrote:
>> Don't hold your breath :( failing at Count: 832000
>
> Same error as I?
>
>> David Reyes Samblas Martinez
>> http://www.tuxbrain.com
>> Open ultraportable & embedded solutions
>> Openmoko, Openpandora,  Arduino
>> Hey, watch out!!! There's a linux in your pocket!!!
>>
>>
>>
>>
>> 2009/11/20 Tilman Baumann <tilman at baumann.name>:
>>>
>>> David Reyes Samblas Martinez wrote:
>>>> Well spanish one give me the same error before but now it works,
>>> Any idea what solved it? Or is it just random and will go away if I try
>>> it
>>> again? :)
>>>
>>>> I'm parsing the de wikipedia right now (Count: 173000) lets see whats
>>>> happens :)
>>>
>>> I would definitely be interessted in the results...
>>>
>>>> Note:Parsing the 2009-Nov-11
>>>> http://download.wikipedia.org/dewiki/latest/dewiki-latest-pages-articles.xml.bz2
>>>>
>>>> Regards
>>>>
>>>> David Reyes Samblas Martinez
>>>> http://www.tuxbrain.com
>>>> Open ultraportable & embedded solutions
>>>> Openmoko, Openpandora,  Arduino
>>>> Hey, watch out!!! There's a linux in your pocket!!!
>>>>
>>>>
>>>>
>>>>
>>>> 2009/11/20 Tilman Baumann <tilman at baumann.name>:
>>>>> Can you reproduce this with a neutral locale?
>>>>>  export LC_ALL=C
>>>>>
>>>>> I'm at the moment trying the same. I had a lot of hickups, caused by
>>>>> many
>>>>> things. Among them missing tools and not enough memory.
>>>>>
>>>>> This is currently where I'm stuck with the German wikipedia.
>>>>>
>>>>> Count: 823000
>>>>> Count: 824000
>>>>> Count: 825000
>>>>> Count: 826000
>>>>> Count: 827000
>>>>> Count: 828000
>>>>> Count: 829000
>>>>> Count: 830000
>>>>> Count: 831000
>>>>> Count: 832000
>>>>> Count: 833000
>>>>> Traceback (most recent call last):
>>>>>  File "./ArticleParser.py", line 203, in <module>
>>>>>    main()
>>>>>  File "./ArticleParser.py", line 168, in main
>>>>>    process_article_text(title.encode('utf-8'),  f.read(length), newf)
>>>>>  File "./ArticleParser.py", line 197, in process_article_text
>>>>>    newf.write(text + '\n')
>>>>> IOError: [Errno 32] Broken pipe
>>>>> make[1]: *** [parse] Error 1
>>>>> make[1]: Leaving directory
>>>>> `/home/tilli/wikireader/host-tools/offline-renderer'
>>>>> make: *** [parse] Error 2
>>>>>
>>>>> I suppose it failed somewhere in PARSER_COMMAND
>>>>>
>>>>>
>>>>> Before that, the following steps went through without fail.
>>>>> make
>>>>> make DESTDIR=image WORKDIR=work
>>>>> XML_FILES=dewiki-20091028-pages-articles.xml index
>>>>>
>>>>>
>>>>> David Reyes Samblas Martinez wrote:
>>>>>> After the "success" of the spanish wikipedia pending to resolve the
>>>>>> indexing part, I was starting to work on the german wikipedia
>>>>>> http://download.wikipedia.org/dewiki/latest/dewiki-latest-pages-meta-current.xml.bz2
>>>>>>
>>>>>> but it fails at first step with the following error
>>>>>>
>>>>>> #make DESTDIR=image WORKDIR=work
>>>>>> XML_FILES=dewiki-latest-pages-meta-current.xml index parse render
>>>>>> combine
>>>>>>
>>>>>> awk: línea ord.:1: fatal: no se puede abrir el fichero
>>>>>> `work/counts.text' para lectura (No existe el fichero ó directorio)
>>>>>> cd host-tools/offline-renderer && make index \
>>>>>>
>>>>>> XML_FILES="/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/dewiki-latest-pages-meta-current.xml"
>>>>>> RENDER_BLOCK="0" \
>>>>>>
>>>>>> WORKDIR="/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/work"
>>>>>> DESTDIR="/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/image"
>>>>>> make[1]: se ingresa al directorio
>>>>>> `/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/host-tools/offline-renderer'
>>>>>> ./ArticleIndex.py  \
>>>>>>
>>>>>> --article-index="/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/work/articles.db"
>>>>>> \
>>>>>>
>>>>>> --article-offsets="/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/work/offsets.db"
>>>>>> \
>>>>>>
>>>>>> --article-counts="/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/work/counts.text"
>>>>>> \
>>>>>>
>>>>>> --prefix="/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/image/pedia"
>>>>>> /OE/Proyectos/tuxbrain/productos/wikireader/wikireader/dewiki-latest-pages-meta-current.xml
>>>>>> Traceback (most recent call last):
>>>>>>   File "./ArticleIndex.py", line 611, in <module>
>>>>>>     main()
>>>>>>   File "./ArticleIndex.py", line 172, in main
>>>>>>     limit = processor.process(f, limit)
>>>>>>   File
>>>>>> "/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/host-tools/offline-renderer/FileScanner.py",
>>>>>> line 141, in process
>>>>>>     if '#' == body[0] and 'redirect' == body[1:9].lower():
>>>>>> IndexError: string index out of range
>>>>>> Flushing databases
>>>>>> Writing: files
>>>>>> Time: 0s
>>>>>> Writing: articles
>>>>>> Time: 0s
>>>>>> Writing: offsets
>>>>>> Time: 0s
>>>>>> Loading: articles
>>>>>> Time: 0s
>>>>>> Loading: offsets and files
>>>>>> Time: 0s
>>>>>> make[1]: *** [index] Error 1
>>>>>> make[1]: se sale del directorio
>>>>>> `/OE/Proyectos/tuxbrain/productos/wikireader/wikireader/host-tools/offline-renderer'
>>>>>> make: *** [index] Error 2
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> David Reyes Samblas Martinez
>>>>>> http://www.tuxbrain.com
>>>>>> Open ultraportable & embedded solutions
>>>>>> Openmoko, Openpandora,  Arduino
>>>>>> Hey, watch out!!! There's a linux in your pocket!!!
>>>>>>
>>>>>> _______________________________________________
>>>>>> Openmoko community mailing list
>>>>>> community at lists.openmoko.org
>>>>>> http://lists.openmoko.org/mailman/listinfo/community
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Openmoko community mailing list
>>>>> community at lists.openmoko.org
>>>>> http://lists.openmoko.org/mailman/listinfo/community
>>>>>
>>>>
>>>> _______________________________________________
>>>> Openmoko community mailing list
>>>> community at lists.openmoko.org
>>>> http://lists.openmoko.org/mailman/listinfo/community
>>>>
>>>
>>>
>>> --
>>>
>>>
>>>
>>> _______________________________________________
>>> Openmoko community mailing list
>>> community at lists.openmoko.org
>>> http://lists.openmoko.org/mailman/listinfo/community
>>>
>>
>> _______________________________________________
>> Openmoko community mailing list
>> community at lists.openmoko.org
>> http://lists.openmoko.org/mailman/listinfo/community
>>
>
>
> --
>
>
>
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community
>



More information about the community mailing list