I am working on scraping items from across a number of websites(using scrapy for the same). The items I am trying to scrape are not always well defined and might be within texts. So I am using string matches to recognize items. However, this also yields some unwanted information along with my required data and my scraper takes a long time scraping unwanted information. To avoid this, I have put an upper limit on the number of items scraped. By using an "if" condition, I am raising a CloseSpider() exception on reaching the upper limit. This approach worked fine until I had only one domain to scrape. How do I extend it for multiple domains.
class CustomSpider(CrawlSpider):
name = "myspider"
start_urls = ['https://www.example1.com/']
allowed_domains = ['www.example1.com']
rules = [Rule(LinkExtractor(allow=()), callback='parse_info', follow = True)]
def parse_info(self, response):
scrape_count = self.crawler.stats.get_value('item_scraped_count')
if scrape_count == 20:
raise CloseSpider("Limit Reached")
My question is, how to extend this code for the following scenario:
class CustomSpider(CrawlSpider):
name = "myspider"
start_urls = ['https://www.example1.com/', 'https://www.example2.com/']
allowed_domains = ['www.example1.com', 'www.example2.com/']
rules = [Rule(LinkExtractor(allow=()), callback='parse_info', follow = True)]
def parse_info(self, response):
suggest change in logic here
scrape_count = self.crawler.stats.get_value('item_scraped_count')
if scrape_count == 20:
raise CloseSpider("Limit Reached")
item
to adomain
, does it have a field that indicates to which domain it belongs? something likeitem = {'domain': 'www.example2.com'}
? – eLRuLL