I'm trying to use wget to elegantly & politely download all the pdfs from a website. The pdfs live in various sub-directories under the starting URL. It appears that the -A pdf option is conflicting with the -r option. But I'm not a wget expert! This command:
wget -nd -np -r site/path
faithfully traverses the entire site downloading everything downstream of path (not polite!). This command:
wget -nd -np -r -A pdf site/path
finishes immediately having downloaded nothing. Running that same command in debug mode:
wget -nd -np -r -A pdf -d site/path
reveals that the sub-directories are ignored with the debug message:
Deciding whether to enqueue "https://site/path/subdir1". https://site/path/subdir1 (subdir1) does not match acc/rej rules. Decided NOT to load it.
I think this means that the sub directories did not satisfy the "pdf" filter and were excluded. Is there a way to get wget to recurse into sub directories (of random depth) and only download pdfs (into a single local dir)? Or does wget need to download everything and then I need to manually filter for pdfs afterward?
UPDATE: thanks to everyone for their ideas. The solution was to use a two step approach including a modified version of this: http://mindspill.net/computing/linux-notes/generate-list-of-urls-using-wget/