I wanted do download all the files from a webpage. I tried wget
but it was failing so I decided for the Python route and I found this thread.
After reading it, I have made a little command line application, soupget
, expanding on the excellent answers of PabloG and Stan and adding some useful options.
It uses BeatifulSoup to collect all the URLs of the page and then download the ones with the desired extension(s). Finally it can download multiple files in parallel.
Here it is:
from __future__ import (division, absolute_import, print_function, unicode_literals)
import sys, os, argparse
from bs4 import BeautifulSoup
def collect_all_url(page_url, extensions):
Recovers all links in page_url checking for all the desired extensions
conn = urllib2.urlopen(page_url)
html = conn.read()
soup = BeautifulSoup(html, 'lxml')
links = soup.find_all('a')
results = []
for tag in links:
link = tag.get('href', None)
if link is not None:
for e in extensions:
if e in link:
if bool(urlparse.urlparse(link).scheme) and bool(urlparse.urlparse(link).netloc):
return results
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description='Download all files from a webpage.')
'-u', '--url',
help='Page url to request')
'-e', '--ext',
help='Extension(s) to find')
'-d', '--dest',
help='Destination where to save the files')
'-p', '--par',
action='store_true', default=False,
help="Turns on parallel download")
args = parser.parse_args()
all_links = collect_all_url(args.url, args.ext)
if not args.par:
for l in all_links:
filename = download_file(l, args.dest)
except Exception as e:
print("Error while downloading: {}".format(e))
from multiprocessing.pool import ThreadPool
results = ThreadPool(10).imap_unordered(
lambda x: download_file(x, args.dest), all_links)
for p in results:
An example of its usage is:
python3 soupget.py -p -e <list of extensions> -d <destination_folder> -u <target_webpage>
And an actual example if you want to see it in action:
python3 soupget.py -p -e .xlsx .pdf .csv -u https://healthdata.gov/dataset/chemicals-cosmetics
. Among other things,wget
(1) preserves timestamps (2) auto-determines filename from url, appending.1
(etc.) if the file already exists (3) has many other options, some of which you may have put in your.wgetrc
. If you want any of those, you have to implement them yourself in Python, but it's simpler to just invokewget
from Python. – ShreevatsaRimport urllib.request; s = urllib.request.urlopen('http://example.com/').read().decode()
– Basj