3
votes

When scrapy shuts down, it will forget all the urls. I want to give scrapy a set of urls which have been crawled, when it is begin. How could add a rule to crawlspider to let it know which urls have been visited?

current function:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

just use parse to tell spider which url to crawl. How could I tell scrapy which urls should not visit?

2

2 Answers

3
votes

When scrapy stops it will save crawled URLS fingerprints in a request.seen file. This is done by the dedup class which is used to crawl an url twice but it you restart a scraper with same job directory it will not crawl already seen urls. If you want to control this process you can replace the default dedup class by your own. An other solution is to add your own spidermiddleware

0
votes

Scrapy's Jobs functionality allows you to start and pause your spider. You can persist information about the spider between runs and it will automatically skip duplicate requests when you restart.

See here for more information: Jobs: pausing and resuming crawls