I'm getting updates from thousands of web pages. There can be multiple pages with the same domain. I've set DOWNLOAD_DELAY
to 1 second so I don't overload servers.
Spider works well but if there are for example 100 urls of the same domain next to each other, it slows crawling because Spider has to wait 1 second after each request.
Is it possible to make it crawl next urls with different domain so Spider doesn't have to wait?
For example:
CONCURRENT_REQUESTS = 3
DOWNLOAD_DELAY = 1
URLS: A.com/1 ,A.com/2 ,A.com/3 ,A.com/4 ,B.com/1 ,B.com/2 ,B.com/3
Spider will start scraping first three urls. It will take at least three seconds because of download delay. But it would be faster if it processed B.com/1 instead of A.com/2 (for example).
class MainSpider(scrapy.Spider):
...
def __init__(self, scraping_round, frequencies=None):
super(MainSpider, self).__init__())
...
def start_requests(self):
for url in self.urls:
yield scrapy.Request(url=url, callback=self.parse, errback=self.err, dont_filter=True)
Maybe I should reorder urls
list.