wget recursion and file extraction

Question

I'm trying to use wget to elegantly & politely download all the pdfs from a website. The pdfs live in various sub-directories under the starting URL. It appears that the -A pdf option is conflicting with the -r option. But I'm not a wget expert! This command:

wget -nd -np -r site/path

faithfully traverses the entire site downloading everything downstream of path (not polite!). This command:

wget -nd -np -r -A pdf site/path

finishes immediately having downloaded nothing. Running that same command in debug mode:

wget -nd -np -r -A pdf -d site/path

reveals that the sub-directories are ignored with the debug message:

Deciding whether to enqueue "https://site/path/subdir1". https://site/path/subdir1 (subdir1) does not match acc/rej rules. Decided NOT to load it.

I think this means that the sub directories did not satisfy the "pdf" filter and were excluded. Is there a way to get wget to recurse into sub directories (of random depth) and only download pdfs (into a single local dir)? Or does wget need to download everything and then I need to manually filter for pdfs afterward?

UPDATE: thanks to everyone for their ideas. The solution was to use a two step approach including a modified version of this: http://mindspill.net/computing/linux-notes/generate-list-of-urls-using-wget/

sicsmpr sicsmpr · Accepted Answer · 2020-09-02T19:26:35

UPDATE: thanks to everyone for their ideas. The solution was to use a two step approach including a modified version of this: http://mindspill.net/computing/linux-notes/generate-list-of-urls-using-wget/

wget recursion and file extraction

2 Answers