New archive file format (was: [omgps] collect feature requests)

Thu Jul 2 09:15:22 CEST 2009

On Thu, Jul 2, 2009 at 8:42 AM, Alexander Shulgin<alex.shulgin at gmail.com> wrote:
> I fail to see how is this true for normal tar files (vs. data read
> from pipe).  Can you elaborate please?

Yepp, of course;)

Tar archive does not contain the byte positions of files inside the archive.
That means accessing a file inside the archive needs to read the whole
content before it, and determine where each file ends. (and you test
if you are at the desired file by reading its header).

It simply lacks of a TOC (table of content).

So accessing the last file in the archive reuires to reading the whole archive.
You can read it here:
http://en.wikipedia.org/wiki/Tar_(file_format)#Format_details

Simplification of tar archive:
[1. file header][1. file][2.file header][2. file][3. file header][3. file]

So how you read the third file from the archive? You read the file until the
[3. file header], your test is successfull (is it the right file?),
and you read the
file itself. You see? You have read the whole file, just accessing the
last item inside.

>> Zip support accessing each files in the archive, although
>> it compress the file by default.
>
> Pardon my ignorance, but wouldn't zip -0 do the trick for your purpose? :)

It will do more or less, however there are three main problems with it:

1. you can only obtain the whole file from the archive. So you cant
  read a part of the file. So if you packed lets say a 700MB file to zip,
  you run out of memory on neo.
 At least this is the case on standard python zipfile module.

2. There is no random access feature, at
    least not in standard python modules.
3. There are significant processor time wasted when accessing to a file
   (many computation required). Btw, it needs to benchmark on the neo, how
   worse is it.

Best regards,
  Laszlo