New archive file format (was: [omgps] collect feature requests)

Thu Jul 2 00:48:07 CEST 2009

add another wow from here :o

2009/7/1 jeremy jozwik <jerjoz.forums at gmail.com>:
> wow
>
> On Wed, Jul 1, 2009 at 2:20 PM, Laszlo
> KREKACS<laszlo.krekacs.list at gmail.com> wrote:
>>> I dont want to compress at all. The 118MB for me is perfect. I only
>>> want to pack the directory into a file. But not compressing.
>>> Im thinking about tar or ar.
>>
>> Hi!
>>
>> I have studied all the available archive and compression options.
>> Most notably tar[1][2][4][6] and zip file format [3].
>> They are the most common archive types. I read also ar (dpkg
>> and ipkg uses it) and cpio format. So I did my homework, and
>> made some researches.
>>
>> Our requirements:
>> - no compression (no wasted cpu time)
>> - random access (no slow waiting time and memory issue)
>> - readily available module/library for easy of integrating
>>  (best: no additional package is required to install on the phone)
>>
>> Tar completely fail at random access, simply it lacks the
>> table of content, so accessing the last file in the archive
>> requires reading the whole content before it.
>>
>> Zip support accessing each files in the archive, although
>> it compress the file by default.
>>
>> There are dar[5] and xar[7], which meets our random access
>> criteria. However dar needs to be ported to the device, and
>> xar is still in development (that means limited python support
>> for example).
>>
>> So I wrote down the most dumb archive fileformat ever;)
>> When I wrote the specification, I only had one goal:
>> make it so simple, that everybody can implement it,
>> so no need to wait for ready-made library.
>>
>> It is called KISS fileformat (keep it simple and stupid),
>> the preferred extension would be filename.kiss
>>
>> You can read it here, I also included it (at the end of mail)
>>  for reference:
>> http://pastebin.com/m608acaeb
>>
>> I think it is suitable for our map tile usage.
>>
>> What do you think?
>>
>> Best regards,
>>  Laszlo
>>
>> [1]: http://en.wikipedia.org/wiki/Tar_(file_format)
>> [2]: http://www.python.org/doc/2.5.2/lib/module-tarfile.html
>> [3]: http://www.python.org/doc/2.5.2/lib/module-zipfile.html
>> [4]: http://en.wikipedia.org/wiki/Comparison_of_file_archivers
>> [5]: http://en.wikipedia.org/wiki/DAR_(Disk_Archiver)
>> [6]: http://en.wikipedia.org/wiki/Archive_formats
>> [7]: http://code.google.com/p/xar/
>>
>> KISS archive fileformat specification:
>>
>> # KISS archive format (Keep It Simple and Stupid)
>>
>> ## General properties
>> - blocksize: 512 bytes
>> - only store filename (and directory if any) and content
>> - first file contains the filenames
>> - header: start block, end block, position of last block
>>
>> ## Overall file structure
>> [header][filenames][1. file][2. file][3. file]
>>
>> ## [header]
>> [SB][EB][POS] [SB][EB][POS] [SB][EB][POS] etc..
>> [ 4][ 4][  2] [ 4][ 4][  2] [ 4][ 4][  2] etc..
>> [   header  ] [ filenames ] [1. file    ] etc..
>>
>> SB (start block): 4 byte
>> EB (end block): 4 byte
>> POS (position of last block): 2 byte
>>
>> All numbers are stored big-endian. That means most significant bit first.
>> Example:
>> 613 dec = 265 hex = \00 \00 \02 \65 (4 bytes)
>> 130411 dec = 1FD6B hex = \00 \01 \FD \6B (4 bytes)
>>
>> Note:
>> The remaining part of the header block MUST be filled with zero bytes.
>> You will always have remaining part in the block, simply each file
>> takes 10 bytes. (512/10 = 51 and 2 bytes left)
>>
>> ## [filenames]
>> UTF-8 text for each filename, delimited with '\n' byte.
>> The directory structure is preserved too.
>> [name of 1. file]['\n'][name of 2. file]['\n'][name of 3. file] etc..
>>
>> Some examples:
>> this is a file.txt
>> this2.tar.gz
>> this3.html
>> images/loller.html
>> weird_dir/this\/files contains\/several\\ slashes.txt
>>
>> Special characters:
>> '\n': You cant have '\n' character in the filename. It is preserved.
>>      (it is not supported in most filesystems anyway)
>> '/': directory delimiter. To save directory structure.
>> '\/': if the filename itself contains an / character
>> '\\': if the filename itself contains a \ character
>>
>>
>> ## [X. file]
>> The file content as is.
>>
>>
>> ## FAQ:
>> Q: Why another archive format?
>> A: Because it is the most dumb format ever;)
>>
>> Q: Why not tar, ar, zip, [name archive type here]?
>> A: Short answer: widely used archive format are not suited for random access
>>                 with no compression.
>>   Long answer: tar: there is no index, reading the last file of the archive
>>                     requires reading the whole file before it.
>>                zip: individual files are compressed, which means: processortime
>>                xar: it would fit the requirements, but it is not widely
>>                     supported, and not in every language.
>>
>> Q: I use X language does KISS supported there?
>> A: The fileformat is so simple, it is intented, every programmer
>>   could implement it in "no time".
>>
>> Q: Does compression supported?
>> A: No. But you can compress the whole file,
>>   just like in tar case: filename.kiss.bz2. Use it for file sharing.
>>
>> Q: Do advanced features (rights, symlinks, hardlinks, user/group/other) are
>>   preserved?
>> A: No. It was not the goal of this archive. Although you can implement it, just
>>   write those informations in the first file. It is not recommended.
>>
>> Q: If the original file is not multiple of 512 bytes, how it will look in the
>>   archive, how many bytes will it take?
>> A: Lets have an example. We have three files:
>>   768bytes file, 1024 bytes, 2047 bytes
>>   First file (768 bytes) will take two blocks: 2*512 = 1024 bytes
>>   Second file (1024 bytes) will take two blocks too: 2*512 = 1024 bytes
>>   Third file (2047 bytes) will take four blocks: 4*512 = 2048 bytes
>>   Lets name the files:
>>            - "first filename.extension", (24 bytes)
>>            - "second try", (10 bytes)
>>            - "I want a sexy name.txt", (22 bytes)
>>   The [filenames] section:
>>   [24 bytes][1 byte][10 bytes][1 byte][22 bytes] = 58bytes = 1 block
>>
>>   The header ([SB][EB][POS]):
>>   [00 00 00 00][00 00 00 00][00 31] (this is the header itself, start at the
>>                                      0. block, ends at 0. block, and
>>                                      the header is 50 bytes long. That means
>>                                      start at the 0. byte and
>>                                      ends at the at the 49. byte.
>>                                      (0..49 = 50 bytes, which is 0x31))
>>   [00 00 00 01][00 00 00 01][00 39] (58 byte long, that means 0..57 bytes
>>                                      and 57 dec = 0x39 hexa)
>>   [00 00 00 02][00 00 00 03][00 FF]  (768-512 = 256, so 0..255 bytes.
>>                                       255dec = FF hexa)
>>   [00 00 00 04][00 00 00 05][01 FF]  (1024 bytes = 2 blocks, the
>> second is full)
>>   [00 00 00 06][00 00 00 09][01 FE]  ( 2047-(3*512) = 511. 510dec = 1FE hexa)
>>
>>
>>   The overall filesize:
>>   [header][filenames][1. file][2. file][3. file]
>>   [ 1    ][  1      ][  2    ][   2   ][   4   ] = 10 blocks = 5120B = 5kB
>>
>> Q: How is it filled the unused part of the block (if the file is
>>   not multiple of 512 bytes) ?
>> A: It can be random bytes. But should be zero bytes. Or checksum if there is
>>   enough space left (see section "An insane idea for checksums").
>>
>> ## Implementation advices
>>
>> 1. Count the files what you want to archive -> you know how much
>>   space is required by the header. 1-49 files requires one block for the header
>>   (10 bytes for the header, 10 bytes for the filenames section, x*10 bytes
>>   for the files itself. Maximum x is 49 for one block)
>> 2. dump 0xFF for the header (look at the "tape archiving" to understand why FF)
>> 3. Generate the filenames section and write the filenames section.
>> 4. Attache each file to the archive, and generate the header
>>   on-the-fly (in memory)
>> 4. Overwrite the header with valid data.
>>
>> ## An insane idea for checksums (integrity checking)
>>
>> Here is the idea, write the checksum at the remaining space, if the
>> file is multiple of 512 byte, write two checksums at the next end of file,
>> if there is no enough space, write it at the next file, and so on.
>> If each file was multiple of 512 bytes, or there are not enough space
>> at the end of each files. There will be no checksum for some of the last files
>> (but it is always better then having no checksums at all).
>> Which should be rare, but if you are worried about it, you can always add
>> a new file with all the necessary informations.
>>
>> This section is not mandatory for the fileformat. So if you are brave enough,
>> implement it! If you dont care, no worries.
>>
>> CRC32 is 4 bytes (32 bits) long.
>>
>> I think 4bytes should be safe enough;)
>> A little example code in python how to calculate it:
>> import binascii
>>
>> def crc2hex(crc):
>>    res=''
>>    for i in range(4):
>>        t=crc & 0xFF
>>        crc >>= 8
>>        res='%02X%s' % (t, res)
>>    return res
>>
>> if __name__=='__main__':
>>    test='hello world! and Python too ;)'
>>    crc=binascii.crc32(test)
>>    print 'CRC:', crc
>>    hex_str = crc2hex(crc)
>>    print 'CRC in hex:', hex_str
>>    print 'in byte representation: ', hex_str.decode("hex")
>>
>> MD5sum is 16bytes long.
>>
>> The CRC32 (4bytes) is recommended. It is enough to detect inconsistencies.
>>
>>
>> ## An another insane idea for tape archiving
>>
>> Who the hell use tapes these days?;)
>> So if the first block is filled with FF hexa, it means the header is at the
>> end of the archive file, the tailer is more right term;)
>> So when you archive the tape, you cant reverse and
>> write the header at the beginning of file.
>> In that case, the header (at the end of file) is in REVERSE order.
>> So the last 4 bytes tells where the header begins. So no need to search for it.
>> Simply read the last 4 bytes, determine where the header
>> begins(reverse order!),
>> and read those blocks. Reverse the byte orders, and thats way you can process
>> it normally.
>>
>> _______________________________________________
>> Openmoko community mailing list
>> community at lists.openmoko.org
>> http://lists.openmoko.org/mailman/listinfo/community
>>
>
> _______________________________________________
> Openmoko community mailing list
> community at lists.openmoko.org
> http://lists.openmoko.org/mailman/listinfo/community
>


-- 
David Reyes Samblas Martinez
http://www.tuxbrain.com
Open ultraportable & embedded solutions
Openmoko, Openpandora,  Arduino
Hey, watch out!!! There's a linux in your pocket!!!