Getting zip/rar structure without full downloading

Question

Is it possible to understand what is inside archive on the web site without full downloading? For example, I want to know where there is pdf file inside. If yes, I will download such zip/rar, if no - I'll skip it. So, is it possible to get small part of the archive and decompress folder/file structure?

What language and library are you using to decompress the zip/rar? It's possible that each will act differently. — Lynn Crumbling
I have not decided yet. I code on Java, so if you can offer something, this should be great. — user3365404

Gfy Gfy · Accepted Answer · 2014-03-01T09:47:19

Yes this is possible, but I think it will also depend on the server you are downloading from. You will need to make HTTP range requests to get pieces of the data you are requesting.

For ZIP files you will want to grab all the central directory records at the end of the file. You do this by grabbing enough of the last data and look for the End of central directory record (EOCD). This should be the last 22 bytes starting with 0x06054b50 if there is no comment. This record has an offset of where the central directory will start, relative to the start of the archive. Then you make sure if you have downloaded all that data in the first grab or if you need to grab some more again. After that you just have to interpret the central directory file headers to see if there is a PDF file inside the ZIP. Info about the file format can be found on the Wikipedia page or in one of the references over there.

Doing the same for RAR files will be harder because there is no single place to grab all the meta data from. You will need to check the file header blocks that are all over the RAR. If the file has only one archived file, you can just grab the first X bytes and check that. Have a look at the RAR TechNote.txt for how to parse a RAR file.

I've done the same thing for RAR files, but then from Usenet based on an NZB file. The resulting RAR meta data is collected inside an SRR file. That and other RAR related code you can find in the pyReScene project. Doing the same from HTTP will be a lot easier because you can ignore yEnc encoding stuff and can be more precise in selecting byte ranges.

Getting zip/rar structure without full downloading

2 Answers