md5 checksums when uploading to file picker

Question

Background

I'm working on integrating an existing app with File Picker. In our existing setup we are relying on md5 checksums to ensure data integrity. As far as I can see File Picker does not provide any md5 when they respond to an upload against the REST API (nor using JavaScript client).

S3 storage, md5 and data integrity

We are using S3 for storage, and as far as I know you may provide S3 with an md5 checksum when storing files so that Amazon may verify and reject storing request if data seems to be wrong.

To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.

I have investigated the etag header which Amazon returns a bit, and found that it isn't clear what actually is returned as etag. The Java documentation states:

Gets the hex encoded 128-bit MD5 hash of this object's contents as computed by Amazon S3.

The Ruby documentation states:

Generally the ETAG is the MD5 of the object. If the object was uploaded using multipart upload then this is the MD5 all of the upload-part-md5s

Another place in their documentation I found this:

The entity tag is a hash of the object. The ETag only reflects changes to the contents of an object, not its metadata. The ETag is determined when an object is created. For objects created by the PUT Object operation and the POST Object operation, the ETag is a quoted, 32-digit hexadecimal string representing the MD5 digest of the object data. For other objects, the ETag may or may not be an MD5 digest of the object data. If the ETag is not an MD5 digest of the object data, it will contain one or more non-hexadecimal characters and/or will consist of less than 32 or more than 32 hexadecimal digits.

This seems to describe how etag is actually calculated on S3, and this stack overflow post seems to imply the same thing: Etag cannot be trusted to always be equal to the file MD5.

So - here are my questions

In general, how does file picker store files to s3? Is multipart post requests used?
I see that when I do a HEAD request against for example https://www.filepicker.io/api/file/<file handle> I do get an etag header back. The etag I get back do indeed match the md5 of the file I have uploaded. Are the headers returned more or less taken from S3 directly? Or is this actually an md5 calculated by filepicker which I can trust?
Is it possible to have an explicit statement of the md5 returned to clients of File Picker's API? For instance when we POST a file we get a JSON structure back including the URL to the file and it's size. Could md5 be included here?
Is it possible to provide File Picker with an md5 which in turn will be used when posting files to S3 so we can get an end-to-end check on files?

brettcvz brettcvz · Accepted Answer · 2014-03-14T03:11:20

Yes, we use the python boto library to be specific.
The ETag is pulled from S3.
& 4. It's been considered and is in our backlog, but hasn't been implemented yet.

md5 checksums when uploading to file picker

Background

S3 storage, md5 and data integrity

So - here are my questions

1 Answers