I've done a bulk download from archive.org using wget which was set to spit out a list of all files per IDENTIFIER
into their respective folders.
wget -r -H -nc -np -nH -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'
Which results in folders organised thus from a root, for example:
./IDENTIFIER1/file.blah
./IDENTIFIER1/something.la
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb001.gif
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb002.gif
./IDENTIFIER1/IDENTIFIER_files.xml
./IDENTIFIER2/etc.etc
./IDENTIFIER2/blah.blah
./IDENTIFIER2/thumbnails/IDENTIFIER_thumb001.gif
etc
The IDENTIFIER
is the name of a collection of files that comes from archive.org, hence, in each folder, there is also the file called IDENTIFIER_files.xml
which contains checksums for all the files in that folder, wrapped in the various xml tags.
Since this is a bulk download and there's hundreds of files, the idea is to write some sort of script (preferably bash? Edit: Maybe PHP?) that can select each .xml file and scrape it for the hashes to test them against the files to reveal any corrupted, failed or modified downloads.
For example:
From archive.org/details/NuclearExplosion, XML is:
https://archive.org/download/NuclearExplosion/NuclearExplosion_files.xml
If you check that link you can see there's both the option for MD5 or SHA1 hashes in the XML, as well as the relative file paths in the file tag (which will be the same as locally).
So. How do we:
For each folder of
IDENTIFIER
, select and scrape the XML for each filename and the checksum of choice;Actually test the checksum for each file;
Log outputs of failed checksums to a file that lists only the failed
IDENTIFIER
(say a file called./RetryIDs.txt
for example), so a download reattempt can be tried using that list...wget -r -H -nc -np -nH -e robots=off -l1 -i ./RetryIDs.txt -B 'http://archive.org/download/'
Any leads on how to piece this together would be extremely helpful.
And another added incentive---probably a good idea too, if there is a solution, if we let archive.org know so they can put it on their blog. I'm sure I'm not the only one that will find this very useful!
Thanks all in advance.
Edit: Okay, so a bash script looks tricky. Could it be done with PHP?
bash
script to do this. Use a higher level language (python, ruby, perl, whatever) that can do actual XML parsing. Then try writing something and come back with code and specific technical questions. – larsks