I am new to Scrapy and I am trying to crawl multiple sites from a text file with CrawlSpider. However I would like to limit the depth of scraping per site and also the total number of crawled pages again per web site. Unfortunately, when the start_urls and allowed_domains attributes are set the response.meta['depth'] always seems to be zero (this doesn't happen when I am trying to scrape individual sites). Setting the DEPTH_LIMIT in the settings file doesn't seem to do anything at all. When I remove the init definition and simply set the start_urls and allowed_domains things seem to be working fine. Here is the code (Sorry for the indentation -- this is not the issue):
class DownloadSpider(CrawlSpider):
name = 'downloader'
rules = (
Rule(SgmlLinkExtractor(), callback='parse_item', follow=True),
)
def __init__(self, urls_file, N=10):
data = open(urls_file, 'r').readlines()[:N]
self.allowed_domains = [urlparse(i).hostname.strip() for i in data]
self.start_urls = ['http://' + domain for domain in self.allowed_domains]
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
print response.url
print response.meta['depth']
This results in response.meta['depth'] always equal to zero and the cralwer only crawls the very first site of each element of start_urls (i.e. it doesn't follow any links). So I have two questions 1) How to limit the crawl to a certain depth per each site in start_urls 2) How to limit the total number of crawls per site irrespective of the depth
Thanks !