How to change urls order in Scrapy spider?

Question

I'm getting updates from thousands of web pages. There can be multiple pages with the same domain. I've set DOWNLOAD_DELAY to 1 second so I don't overload servers.

Spider works well but if there are for example 100 urls of the same domain next to each other, it slows crawling because Spider has to wait 1 second after each request.

Is it possible to make it crawl next urls with different domain so Spider doesn't have to wait?

For example:

CONCURRENT_REQUESTS = 3
DOWNLOAD_DELAY = 1

URLS: A.com/1 ,A.com/2 ,A.com/3 ,A.com/4 ,B.com/1 ,B.com/2 ,B.com/3

Spider will start scraping first three urls. It will take at least three seconds because of download delay. But it would be faster if it processed B.com/1 instead of A.com/2 (for example).

class MainSpider(scrapy.Spider):
    ...

    def __init__(self, scraping_round, frequencies=None):
        super(MainSpider, self).__init__())
        ...

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse, errback=self.err, dont_filter=True)

Maybe I should reorder urls list.

Done Data Solutions Done Data Solutions · Accepted Answer · 2017-04-22T19:03:31

Definitely yes, reordering the list of scheduled requests would help. Can be done like this:

import random

class MainSpider(scrapy.Spider):
    # ....

    def start_requests(self):
        random.shuffle(self.urls)
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse, errback=self.err, dont_filter=True)

Reordering requests that are created subsequently while crawling is more difficult, unfortunately, but maybe this already helps.

Another fix: massively increase CONCURRENT_REQUESTS.

The scrapy documentation suggests to set CONCURRENT_REQUESTS at least to 100 if you want to crawl many domains in parallel:

https://doc.scrapy.org/en/latest/topics/broad-crawls.html#increase-concurrency

The reason for this / detailed explanation

Based on the source code of scrapy/core/downloader.py, engine.py, scraper.py and scrapy/core/downloader/handlers/http11.py it seems scrapy fills up its processing queue with up to CONCURRENT_REQUESTS from the scheduler and does checking the domains for observing the CONCURRENT_REQUESTS_PER_DOMAIN later down the processing chain.

If the scheduler contains a bunch of requests for the same domain in a row you might have multiple requests for the same domain being pulled into the processing queue and thus effectively blocking processing of other domains. This is especially likely to happen if CONCURRENT_REQUESTS is very low as in your example.

This is a known issue described here: https://github.com/scrapy/scrapy/issues/2474

Alternative solutions

A even better solution than increasing CONCURRENT_REQUESTS to a very high value would be using https://github.com/scrapinghub/frontera as a crawling frontier ... which is basically doing what you suggested: reordering the scheduled requests for optimal processing.

How to change urls order in Scrapy spider?

2 Answers