0
votes

wget recurses to the second-bottom level and goes no further. If I specify the bottom level HTML file as the source, it parses it and goes further. I think this may be caused by the PDF files linked off the HTML document being in an different root file path on the server. I need it to retrieve all the PDF files off the leaves of this hierarchy since I am going to promote them together as part of a campaign for depression awareness.

I am using GNU Wget 1.19.4 built on linux-gnu.

I have tried, --exclude, --exclude-directory, -l2, -l10, --continue and many other switches. I need to use the --include commands or wget grabs the entire site. If I use -np it won't go "up" into /docs

This code gets me the HTML files but does not follow links in the "bottom most" HTML files.

wget  --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/

This code, when I manually specify the HTML file, gets the PDF files I want in it.

wget  --mirror --include docs/default-source/research-project-files --include about-us/research-projects/research-projects https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research

I want it to visit all the HTML files in this branch, get out all the PDF links in them, and retrieve all the PDF files from /docs

https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research

Here is one of the PDFs. The /docs directory does not have a listing.

https://www.beyondblue.org.au/docs/default-source/research-project-files/online-forums-2015-report.pdf?sfvrsn=3d00adea_2

The best I can get wget to do is walk the site and get HTML files down to this level:

https://www.beyondblue.org.au/about-us/research-projects/research-projects/online-forums-user-research
https://www.beyondblue.org.au/about-us/research-projects/research-projects/networks-of-advocacy-and-influence-peer-mentors-in-beyond-blue-s-mental-health-forums
...
150 of them

It seems like a depth-limiting setting or a path traversal limitation or something. I suspect it's an easy one to spot. Thanks again!

1

1 Answers

0
votes

Alright it looks like wget might be breadth first. This means gets everything in the directory before recursing into pages. I'm not sure of this but I let the below run and it seemed to get all the leaf HTML files, but then recurse into them after it had got all of them.

wget  -r  --verbose --include /docs/default-source/research-project-files/,/about-us/research-projects/research-projects/ https://www.beyondblue.org.au/about-us/research-projects/research-projects/

Certainly running this and stopping it when it seemed to halt at the bottom HTML layer and not get the PDFs was stopping it too early.