I personally prefer using scrapy and selenium and dockerizing both in separate containers. This way you can install both with minimal hassle and crawl modern websites that almost all contain javascript in one form or another. Here's an example:
Use the scrapy startproject
to create your scraper and write your spider, the skeleton can be as simple as this:
import scrapy
class MySpider(scrapy.Spider):
name = 'my_spider'
start_urls = ['https://somewhere.com']
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0])
def parse(self, response):
# do stuff with results, scrape items etc.
# now were just checking everything worked
print(response.body)
The real magic happens in the middlewares.py. Overwrite two methods in the downloader middleware, __init__
and process_request
, in the following way:
# import some additional modules that we need
import os
from copy import deepcopy
from time import sleep
from scrapy import signals
from scrapy.http import HtmlResponse
from selenium import webdriver
class SampleProjectDownloaderMiddleware(object):
def __init__(self):
SELENIUM_LOCATION = os.environ.get('SELENIUM_LOCATION', 'NOT_HERE')
SELENIUM_URL = f'http://{SELENIUM_LOCATION}:4444/wd/hub'
chrome_options = webdriver.ChromeOptions()
# chrome_options.add_experimental_option("mobileEmulation", mobile_emulation)
self.driver = webdriver.Remote(command_executor=SELENIUM_URL,
desired_capabilities=chrome_options.to_capabilities())
def process_request(self, request, spider):
self.driver.get(request.url)
# sleep a bit so the page has time to load
# or monitor items on page to continue as soon as page ready
sleep(4)
# if you need to manipulate the page content like clicking and scrolling, you do it here
# self.driver.find_element_by_css_selector('.my-class').click()
# you only need the now properly and completely rendered html from your page to get results
body = deepcopy(self.driver.page_source)
# copy the current url in case of redirects
url = deepcopy(self.driver.current_url)
return HtmlResponse(url, body=body, encoding='utf-8', request=request)
Dont forget to enable this middlware by uncommenting the next lines in the settings.py file:
DOWNLOADER_MIDDLEWARES = {
'sample_project.middlewares.SampleProjectDownloaderMiddleware': 543,}
Next for dockerization. Create your Dockerfile
from a lightweight image (I'm using python Alpine here), copy your project directory to it, install requirements:
# Use an official Python runtime as a parent image
FROM python:3.6-alpine
# install some packages necessary to scrapy and then curl because it's handy for debugging
RUN apk --update add linux-headers libffi-dev openssl-dev build-base libxslt-dev libxml2-dev curl python-dev
WORKDIR /my_scraper
ADD requirements.txt /my_scraper/
RUN pip install -r requirements.txt
ADD . /scrapers
And finally bring it all together in docker-compose.yaml
:
version: '2'
services:
selenium:
image: selenium/standalone-chrome
ports:
- "4444:4444"
shm_size: 1G
my_scraper:
build: .
depends_on:
- "selenium"
environment:
- SELENIUM_LOCATION=samplecrawler_selenium_1
volumes:
- .:/my_scraper
# use this command to keep the container running
command: tail -f /dev/null
Run docker-compose up -d
. If you're doing this the first time it will take a while for it to fetch the latest selenium/standalone-chrome and the build your scraper image as well.
Once it's done, you can check that your containers are running with docker ps
and also check that the name of the selenium container matches that of the environment variable that we passed to our scraper container (here, it was SELENIUM_LOCATION=samplecrawler_selenium_1
).
Enter your scraper container with docker exec -ti YOUR_CONTAINER_NAME sh
, the command for me was docker exec -ti samplecrawler_my_scraper_1 sh
, cd into the right directory and run your scraper with scrapy crawl my_spider
.
The entire thing is on my github page and you can get it from here