The Sanity Resort: Constant Tarfiles For Resumable Downloads

Our Situation

We have an Java based backend server providing daily changing content to mobile applications as a gzipped tar archive. The content can change over the course of the day and some applications need to download the complete content in the new version, while others need to get the diff between some versions. In addition old content from previous days has also to be available in all those variations. That means it is not really feasible to have all possible different tar files pregenerated and stored in our backend (yes Hadoop and friends stand ready to help but at the moment this is not an option for us for several reasons).

So what we do is, store the different versions of our content, calculate the needed tar files, deliver them to the requesting app and put the file into a caching server to reduce load but of course we cannot cache everything permanently so the entries timeout from time to time.

Our Problem

Some applications reported downloading issues that we at first blamed on general mobile network issues but after a while it became evident that there was more to it. After some debugging on application side it seemed that in some cases the app was not able to resume an interrupted download due to the checksum of the file having changed.

The Underlying Cause

This issue was quite a surprise for us as we were sure that for every kind of request the corresponding file was created from exactly the same content every time. But by bypassing our cache the effect could be reproduced immediately by downloading the same file twice from our server. The mighty diff tool told me that the two archives differ even though their extracted contents did not (phew...).

My first suspect was gzip and a quick Google search brought up some hints about the gzip timestamp header that is contained in the first 10 bytes of an archive.So I fired up my hex editor to check the starting sequences of the two files only to be disappointed because they were exactly the same. Having a look at the libraries source code of GZipOutputStream (we use the Apache compressor lib) revealed that this implementation sets a fixed timestamp for all gzip archives (all fields are set to 0 whch I think is the right thing to do).

Looking at the differences of the archives in my hex editor and matching those against the tar rfc docs it became clear that the culprit was the mtime header for tar entries. In our code we construct the tar entries using the filepath and then add bytes from a stream to it. The invoked constructor sets the entry's mtime to the current time so it is different every time the archive is created. Eureka!!

The Solution

Now that was rather easy to fix, we now calculate a timestamp value based on the date the content was created at, with a few tweaks, that is constant for each single file and set that as mtime for each tar entry we create, et voilà :-)

Sometimes They Come Back

A few weeks later we got reports about some strange download issues again, this time with only with certain content packages and it was absolutely not obvious where those came from. We had a lot of red herrings to hunt and during one of those hunts I noticed that files that should be identical in some cases simply weren't. Again diff and hex editor came to the rescue and I found that for some entries the mtime headers again did not match!!! So we got down on the code again... Did I forget to handle any tar entries? Nope, everything was covered... Any issues with entries being added twice or changed in other methods? Nope.. nothing.. In addition after extracting the tar all files had the desired timestamp, no sign of the other timestamps at all.. Oh how I swore...

So back to checking the archive contents.. then the tar docs.. and again the archive contents.. finally I found a clue, a classic tar entry has only 100 bytes reserved for it's name field which contains not only the file name but also the complete path to the file within a directory hierarchy and the offending files lay within a rather deep directory hierarchy and had pretty long names itself. To change my suspicion to certainty I added a copy of a file with the correct timestamp to the content and altered it's name to exceed the 100 byte limit. In the resulting tar the copy again had the invalid timestamp and just then I noticed something else, for the copy's name there were two header entry's while the original file only had one.

So I did more research about long file names and header fields and dug into the code of the TarOutputStream class. To handle long file names tar can use additional posix or gnu headers and in both cases the TarOutputStream creates additional tar entries using the same constructor as we did in our code. So if you have long file names in our archive it will contain additional tar entries that always have their creation date as mtime header value!! Oh man, that was mean..

The Solution - For Real This Time

The only way to fix this was to copy the code of TarOutputStream to our project (and TarBuffer as that class is a dependency of the former with default scope) and adjust it so we can set a custom timestamp for the additonal entries.Not my favorite way of fixing things but unfortunately we had no other choice here... So beware of tars with long file names if you need to rely on the checksum..

The Sanity Resort

Mittwoch, 12. Juni 2013

Constant Tarfiles For Resumable Downloads