Indexing large tar files for fast access using python

I recently needed to get some data out of a large tar file, about 5gb in size, that I didn’t want to extract, as it contained many thousands of small files. Unfortunately the tar format was not designed to be indexed, since it was meant for backups on magnetic tapes (tar stands for tape archive). The gnu tar has a command for retrieving single files, but it needs to go through the whole tar each time, which was just too slow.

So I decided to write a little tool, that would index all files inside the archive and write that index to another file. Now I can access each file within the tar in just a second, instead of 15 minutes. Introducing the tarindexer!

UPDATE: The project is now up on github under GPL v3
https://github.com/devsnd/tarindexer

For everyone who just wants to get on with their lives, here is the download:

Download: tarindexer.tar.gz

Usage:
create index file:
tarindexer -i tarfile.tar indexfile.index
lookup file using indexfile (prints file to stdout):
tarindexer -l tarfile.tar indexfile.index lookuppath

Now something on how it works; The tarindexer uses python’s tarfile module to crawl through the files (a.k.a. members, as they can be symlinks or folders as well). I looked inside the source or the tarfile module to find out where the current reading position was (TarInfo.offset_data). and writing that number along with the name and the filesize of that member into the index file. No magic here.

Later, when looking up a file, the program just tries to find the path inside the index file, finds out the position and size of that file, and then seeks to the position inside the tar.

But as I went along I noticed that the tarfile module ate all my RAM. It took only few seconds for the swap to burst… A short search on the internet revealed, that it could be fixed easily by throwing away the references to the members, e.g.:

Now the RAM usage was back to normal, I could finally access the data from the tar rapidly and everybody’s happy.

8 thoughts on “Indexing large tar files for fast access using python

  1. Exactly what I was looking for. The download link unfortunately doesn’t work but I think there is enough information in the article to reproduce. Thanks.

  2. Nice! Two questions, will it work for compressed files?? Also, could I hijack your code for use in my own project (license?). Thanks

  3. Looks like a great tool, but I am strugling to do the same for binary data.

    The script can decode ASCII data. Is there any option in Python to decode other type of binary data.

  4. Useful info. Fortunate me I found your website unintentionally, and
    I am shocked why this twist of fate did not took place
    in advance! I bookmarked it.

Leave a Reply

Your email address will not be published. Required fields are marked *