How to get wget to go down and up host hierarchy

Question

wget recurses to the second-bottom level and goes no further. If I specify the bottom level HTML file as the source, it parses it and goes further. I think this may be caused by the PDF files linked off the HTML document being in an different root file path on the server. I need it to retrieve all the PDF files off the leaves of this hierarchy since I am going to promote them together as part of a campaign for depression awareness.

I am using GNU Wget 1.19.4 built on linux-gnu.

I have tried, --exclude, --exclude-directory, -l2, -l10, --continue and many other switches. I need to use the --include commands or wget grabs the entire site. If I use -np it won't go "up" into /docs

This code gets me the HTML files but does not follow links in the "bottom most" HTML files.

wget  --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/

This code, when I manually specify the HTML file, gets the PDF files I want in it.

wget  --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research

I want it to visit all the HTML files in this branch, get out all the PDF links in them, and retrieve all the PDF files from /docs

https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research

Here is one of the PDFs. The /docs directory does not have a listing.

https://www.beyondblue.org.au/docs/default-source/research-project-files/online-forums-2015-report.pdf?sfvrsn=3d00adea_2

The best I can get wget to do is walk the site and get HTML files down to this level:

https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
https://www.beyondblue.org.au/about-us/research-projects/research-projects/networks-of-advocacy-and-influence-peer-mentors-in-beyond-blue-s-mental-health-forums
...
150 of them

It seems like a depth-limiting setting or a path traversal limitation or something. I suspect it's an easy one to spot. Thanks again!

user11010810 user11010810 · Accepted Answer · 2019-06-24T01:11:15

Alright it looks like wget might be breadth first. This means gets everything in the directory before recursing into pages. I'm not sure of this but I let the below run and it seemed to get all the leaf HTML files, but then recurse into them after it had got all of them.

wget  -r  --verbose --include /docs/default-source/research-project-files/,/about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/

Certainly running this and stopping it when it seemed to halt at the bottom HTML layer and not get the PDFs was stopping it too early.

How to get wget to go down and up host hierarchy

1 Answers