Indexing large tar files for fast access using python

I recently needed to get some data out of a large tar file, about 5gb in size, that I didn’t want to extract, as it contained many thousands of small files. Unfortunately the tar format was not designed to be indexed, since it was meant for backups on magnetic tapes (tar stands for tape archive). The gnu tar has a command for retrieving single files, but it needs to go through the whole tar each time, which was just too slow.

So I decided to write a little tool, that would index all files inside the archive and write that index to another file. Now I can access each file within the tar in just a second, instead of 15 minutes. Introducing the tarindexer!

UPDATE: The project is now up on github under GPL v3
https://github.com/devsnd/tarindexer

Continue reading