0
votes

I'm getting updates from thousands of web pages. There can be multiple pages with the same domain. I've set DOWNLOAD_DELAY to 1 second so I don't overload servers.

Spider works well but if there are for example 100 urls of the same domain next to each other, it slows crawling because Spider has to wait 1 second after each request.

Is it possible to make it crawl next urls with different domain so Spider doesn't have to wait?

For example:

CONCURRENT_REQUESTS = 3
DOWNLOAD_DELAY = 1

URLS: A.com/1 ,A.com/2 ,A.com/3 ,A.com/4 ,B.com/1 ,B.com/2 ,B.com/3

Spider will start scraping first three urls. It will take at least three seconds because of download delay. But it would be faster if it processed B.com/1 instead of A.com/2 (for example).

class MainSpider(scrapy.Spider):
    ...

    def __init__(self, scraping_round, frequencies=None):
        super(MainSpider, self).__init__())
        ...

    def start_requests(self):
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse, errback=self.err, dont_filter=True)

Maybe I should reorder urls list.

2

2 Answers

2
votes

Definitely yes, reordering the list of scheduled requests would help. Can be done like this:

import random

class MainSpider(scrapy.Spider):
    # ....

    def start_requests(self):
        random.shuffle(self.urls)
        for url in self.urls:
            yield scrapy.Request(url=url, callback=self.parse, errback=self.err, dont_filter=True)

Reordering requests that are created subsequently while crawling is more difficult, unfortunately, but maybe this already helps.

Another fix: massively increase CONCURRENT_REQUESTS.

The scrapy documentation suggests to set CONCURRENT_REQUESTS at least to 100 if you want to crawl many domains in parallel:

https://doc.scrapy.org/en/latest/topics/broad-crawls.html#increase-concurrency

The reason for this / detailed explanation

Based on the source code of scrapy/core/downloader.py, engine.py, scraper.py and scrapy/core/downloader/handlers/http11.py it seems scrapy fills up its processing queue with up to CONCURRENT_REQUESTS from the scheduler and does checking the domains for observing the CONCURRENT_REQUESTS_PER_DOMAIN later down the processing chain.

If the scheduler contains a bunch of requests for the same domain in a row you might have multiple requests for the same domain being pulled into the processing queue and thus effectively blocking processing of other domains. This is especially likely to happen if CONCURRENT_REQUESTS is very low as in your example.

This is a known issue described here: https://github.com/scrapy/scrapy/issues/2474

Alternative solutions

A even better solution than increasing CONCURRENT_REQUESTS to a very high value would be using https://github.com/scrapinghub/frontera as a crawling frontier ... which is basically doing what you suggested: reordering the scheduled requests for optimal processing.

0
votes

DOWNLOAD_DELAY setting is applied per website.

The amount of time (in secs) that the downloader should wait before downloading consecutive pages from the same website. This can be used to throttle the crawling speed to avoid hitting servers too hard.

from docs: https://doc.scrapy.org/en/latest/topics/settings.html#download-delay

So what you want should work by default. When spider starts, it queues every url in start_urls immediately and then sorts the delay etc.