The National Speech Corpus is a Natural Language Processing corpus of Singaporean's speaking English, which can be found here: https://www.imda.gov.sg/programme-listing/digital-services-lab/national-speech-corpus.
When you sign up for the free corpus, you are directed to a dropbox folder. The corpus is 1 TB and (as of this writing) has four parts. I only wanted to download PART 1 but even this has 1446 zip files that are each quite larger. My question is: how do I programmatically download many large files from dropbox onto a Linux (Ubunut 16.04) VM using only the command line.
The directory tree for the relevant part looks like:
root
|-LEXICON
|-PART1
|-DATA
|-CHANNEL0
|-WAVE
|-SPEAKER0001.zip
|-SPEAKER0002.zip
...
|-SPEAKER1446.zip
I looked into a few different approaches:
Downloading the
WAVEparent directory using a shared link via thewgetcommand as described in this question. However, this didn't work as I received this error:Reusing existing connection to www.dropbox.com:443 HTTP request sent, awaiting response... 400 Bad Request 2021-01-06 23:09:06 ERROR 400: Bad Request.
I assumed this was because the WAVE directory was too large for Dropbox to zip.
Based on this post, it was suggested that I could download the HTML of the
WAVEparent directory and find all of the direct links to the individual zip files but the direct links to the individual files were not in the HTML file.Based on the same post as in (2), I could also try to create shared links for each zip file using the dropbox API, though this seemed too cumbersome.
Download the Linux dropbox client and sync the relevant files as outlined in this installation.
In the end, the 4th option did work for me, but I wanted to post this investigation for anyone who needs to download this dataset in the future. Also, I wanted to see if anyone else had better approaches.