0
votes

I've done a bulk download from archive.org using wget which was set to spit out a list of all files per IDENTIFIER into their respective folders.

wget -r -H -nc -np -nH -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

Which results in folders organised thus from a root, for example:

./IDENTIFIER1/file.blah
./IDENTIFIER1/something.la
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb001.gif
./IDENTIFIER1/thumbnails/IDENTIFIER_thumb002.gif
./IDENTIFIER1/IDENTIFIER_files.xml

./IDENTIFIER2/etc.etc
./IDENTIFIER2/blah.blah
./IDENTIFIER2/thumbnails/IDENTIFIER_thumb001.gif

 etc

The IDENTIFIER is the name of a collection of files that comes from archive.org, hence, in each folder, there is also the file called IDENTIFIER_files.xml which contains checksums for all the files in that folder, wrapped in the various xml tags.

Since this is a bulk download and there's hundreds of files, the idea is to write some sort of script (preferably bash? Edit: Maybe PHP?) that can select each .xml file and scrape it for the hashes to test them against the files to reveal any corrupted, failed or modified downloads.

For example:

From archive.org/details/NuclearExplosion, XML is:

https://archive.org/download/NuclearExplosion/NuclearExplosion_files.xml

If you check that link you can see there's both the option for MD5 or SHA1 hashes in the XML, as well as the relative file paths in the file tag (which will be the same as locally).

So. How do we:

  1. For each folder of IDENTIFIER, select and scrape the XML for each filename and the checksum of choice;

  2. Actually test the checksum for each file;

  3. Log outputs of failed checksums to a file that lists only the failed IDENTIFIER (say a file called ./RetryIDs.txt for example), so a download reattempt can be tried using that list...

    wget -r -H -nc -np -nH -e robots=off -l1 -i ./RetryIDs.txt -B 'http://archive.org/download/'
    

Any leads on how to piece this together would be extremely helpful.

And another added incentive---probably a good idea too, if there is a solution, if we let archive.org know so they can put it on their blog. I'm sure I'm not the only one that will find this very useful!

Thanks all in advance.


Edit: Okay, so a bash script looks tricky. Could it be done with PHP?

1
For starters, you probably don't want to use a bash script to do this. Use a higher level language (python, ruby, perl, whatever) that can do actual XML parsing. Then try writing something and come back with code and specific technical questions.larsks

1 Answers

3
votes

If you really want to go the bash route, here's something to you started. You can use the xml2 suite of tools to convert XML into something more amendable to traditional shell scripting, and then do something like this:

#!/bin/sh

xml2 < $1 | awk -F= '
    $1 == "/files/file/@name" {name=$2}
    $1 == "/files/file/sha1" {
        sha1=$2
        print name, sha1
    }
'

This will produce on standard output a list of filenames and their corresponding SHA1 checksum. That should get you substantially closer to a solution.

Actually using that output to validate the files is left as an exercise to the reader.