New archive file format (was: [omgps] collect feature requests)

Wed Jul 1 23:20:40 CEST 2009

> I dont want to compress at all. The 118MB for me is perfect. I only
> want to pack the directory into a file. But not compressing.
> Im thinking about tar or ar.

Hi!

I have studied all the available archive and compression options.
Most notably tar[1][2][4][6] and zip file format [3].
They are the most common archive types. I read also ar (dpkg
and ipkg uses it) and cpio format. So I did my homework, and
made some researches.

Our requirements:
- no compression (no wasted cpu time)
- random access (no slow waiting time and memory issue)
- readily available module/library for easy of integrating
  (best: no additional package is required to install on the phone)

Tar completely fail at random access, simply it lacks the
table of content, so accessing the last file in the archive
requires reading the whole content before it.

Zip support accessing each files in the archive, although
it compress the file by default.

There are dar[5] and xar[7], which meets our random access
criteria. However dar needs to be ported to the device, and
xar is still in development (that means limited python support
for example).

So I wrote down the most dumb archive fileformat ever;)
When I wrote the specification, I only had one goal:
make it so simple, that everybody can implement it,
so no need to wait for ready-made library.

It is called KISS fileformat (keep it simple and stupid),
the preferred extension would be filename.kiss

You can read it here, I also included it (at the end of mail)
 for reference:
http://pastebin.com/m608acaeb

I think it is suitable for our map tile usage.

What do you think?

Best regards,
 Laszlo

[1]: http://en.wikipedia.org/wiki/Tar_(file_format)
[2]: http://www.python.org/doc/2.5.2/lib/module-tarfile.html
[3]: http://www.python.org/doc/2.5.2/lib/module-zipfile.html
[4]: http://en.wikipedia.org/wiki/Comparison_of_file_archivers
[5]: http://en.wikipedia.org/wiki/DAR_(Disk_Archiver)
[6]: http://en.wikipedia.org/wiki/Archive_formats
[7]: http://code.google.com/p/xar/

KISS archive fileformat specification:

# KISS archive format (Keep It Simple and Stupid)

## General properties
- blocksize: 512 bytes
- only store filename (and directory if any) and content
- first file contains the filenames
- header: start block, end block, position of last block

## Overall file structure
[header][filenames][1. file][2. file][3. file]

## [header]
[SB][EB][POS] [SB][EB][POS] [SB][EB][POS] etc..
[ 4][ 4][  2] [ 4][ 4][  2] [ 4][ 4][  2] etc..
[   header  ] [ filenames ] [1. file    ] etc..

SB (start block): 4 byte
EB (end block): 4 byte
POS (position of last block): 2 byte

All numbers are stored big-endian. That means most significant bit first.
Example:
613 dec = 265 hex = \00 \00 \02 \65 (4 bytes)
130411 dec = 1FD6B hex = \00 \01 \FD \6B (4 bytes)

Note:
The remaining part of the header block MUST be filled with zero bytes.
You will always have remaining part in the block, simply each file
takes 10 bytes. (512/10 = 51 and 2 bytes left)

## [filenames]
UTF-8 text for each filename, delimited with '\n' byte.
The directory structure is preserved too.
[name of 1. file]['\n'][name of 2. file]['\n'][name of 3. file] etc..

Some examples:
this is a file.txt
this2.tar.gz
this3.html
images/loller.html
weird_dir/this\/files contains\/several\\ slashes.txt

Special characters:
'\n': You cant have '\n' character in the filename. It is preserved.
      (it is not supported in most filesystems anyway)
'/': directory delimiter. To save directory structure.
'\/': if the filename itself contains an / character
'\\': if the filename itself contains a \ character

## [X. file]
The file content as is.

## FAQ:
Q: Why another archive format?
A: Because it is the most dumb format ever;)

Q: Why not tar, ar, zip, [name archive type here]?
A: Short answer: widely used archive format are not suited for random access
                 with no compression.
   Long answer: tar: there is no index, reading the last file of the archive
                     requires reading the whole file before it.
                zip: individual files are compressed, which means: processortime
                xar: it would fit the requirements, but it is not widely
                     supported, and not in every language.

Q: I use X language does KISS supported there?
A: The fileformat is so simple, it is intented, every programmer
   could implement it in "no time".

Q: Does compression supported?
A: No. But you can compress the whole file,
   just like in tar case: filename.kiss.bz2. Use it for file sharing.

Q: Do advanced features (rights, symlinks, hardlinks, user/group/other) are
   preserved?
A: No. It was not the goal of this archive. Although you can implement it, just
   write those informations in the first file. It is not recommended.

Q: If the original file is not multiple of 512 bytes, how it will look in the
   archive, how many bytes will it take?
A: Lets have an example. We have three files:
   768bytes file, 1024 bytes, 2047 bytes
   First file (768 bytes) will take two blocks: 2*512 = 1024 bytes
   Second file (1024 bytes) will take two blocks too: 2*512 = 1024 bytes
   Third file (2047 bytes) will take four blocks: 4*512 = 2048 bytes
   Lets name the files:
            - "first filename.extension", (24 bytes)
            - "second try", (10 bytes)
            - "I want a sexy name.txt", (22 bytes)
   The [filenames] section:
   [24 bytes][1 byte][10 bytes][1 byte][22 bytes] = 58bytes = 1 block

   The header ([SB][EB][POS]):
   [00 00 00 00][00 00 00 00][00 31] (this is the header itself, start at the
                                      0. block, ends at 0. block, and
                                      the header is 50 bytes long. That means
                                      start at the 0. byte and
                                      ends at the at the 49. byte.
                                      (0..49 = 50 bytes, which is 0x31))
   [00 00 00 01][00 00 00 01][00 39] (58 byte long, that means 0..57 bytes
                                      and 57 dec = 0x39 hexa)
   [00 00 00 02][00 00 00 03][00 FF]  (768-512 = 256, so 0..255 bytes.
                                       255dec = FF hexa)
   [00 00 00 04][00 00 00 05][01 FF]  (1024 bytes = 2 blocks, the
second is full)
   [00 00 00 06][00 00 00 09][01 FE]  ( 2047-(3*512) = 511. 510dec = 1FE hexa)

   The overall filesize:
   [header][filenames][1. file][2. file][3. file]
   [ 1    ][  1      ][  2    ][   2   ][   4   ] = 10 blocks = 5120B = 5kB

Q: How is it filled the unused part of the block (if the file is
   not multiple of 512 bytes) ?
A: It can be random bytes. But should be zero bytes. Or checksum if there is
   enough space left (see section "An insane idea for checksums").

## Implementation advices

1. Count the files what you want to archive -> you know how much
   space is required by the header. 1-49 files requires one block for the header
   (10 bytes for the header, 10 bytes for the filenames section, x*10 bytes
   for the files itself. Maximum x is 49 for one block)
2. dump 0xFF for the header (look at the "tape archiving" to understand why FF)
3. Generate the filenames section and write the filenames section.
4. Attache each file to the archive, and generate the header
   on-the-fly (in memory)
4. Overwrite the header with valid data.

## An insane idea for checksums (integrity checking)

Here is the idea, write the checksum at the remaining space, if the
file is multiple of 512 byte, write two checksums at the next end of file,
if there is no enough space, write it at the next file, and so on.
If each file was multiple of 512 bytes, or there are not enough space
at the end of each files. There will be no checksum for some of the last files
(but it is always better then having no checksums at all).
Which should be rare, but if you are worried about it, you can always add
a new file with all the necessary informations.

This section is not mandatory for the fileformat. So if you are brave enough,
implement it! If you dont care, no worries.

CRC32 is 4 bytes (32 bits) long.

I think 4bytes should be safe enough;)
A little example code in python how to calculate it:
import binascii

def crc2hex(crc):
    res=''
    for i in range(4):
        t=crc & 0xFF
        crc >>= 8
        res='%02X%s' % (t, res)
    return res

if __name__=='__main__':
    test='hello world! and Python too ;)'
    crc=binascii.crc32(test)
    print 'CRC:', crc
    hex_str = crc2hex(crc)
    print 'CRC in hex:', hex_str
    print 'in byte representation: ', hex_str.decode("hex")

MD5sum is 16bytes long.

The CRC32 (4bytes) is recommended. It is enough to detect inconsistencies.

## An another insane idea for tape archiving

Who the hell use tapes these days?;)
So if the first block is filled with FF hexa, it means the header is at the
end of the archive file, the tailer is more right term;)
So when you archive the tape, you cant reverse and
write the header at the beginning of file.
In that case, the header (at the end of file) is in REVERSE order.
So the last 4 bytes tells where the header begins. So no need to search for it.
Simply read the last 4 bytes, determine where the header
begins(reverse order!),
and read those blocks. Reverse the byte orders, and thats way you can process
it normally.