How to write a rule for scrapy to add visited urls

Question

When scrapy shuts down, it will forget all the urls. I want to give scrapy a set of urls which have been crawled, when it is begin. How could add a rule to crawlspider to let it know which urls have been visited?

current function:

SgmlLinkExtractor(allow=(), deny=(), allow_domains=(), deny_domains=(), restrict_xpaths(), tags=('a', 'area'), attrs=('href'), canonicalize=True, unique=True, process_value=None)

just use parse to tell spider which url to crawl. How could I tell scrapy which urls should not visit?

gvtech gvtech · Accepted Answer · 2012-11-28T10:29:58

When scrapy stops it will save crawled URLS fingerprints in a request.seen file. This is done by the dedup class which is used to crawl an url twice but it you restart a scraper with same job directory it will not crawl already seen urls. If you want to control this process you can replace the default dedup class by your own. An other solution is to add your own spidermiddleware

How to write a rule for scrapy to add visited urls

2 Answers