4
votes

I am working on scraping items from across a number of websites(using scrapy for the same). The items I am trying to scrape are not always well defined and might be within texts. So I am using string matches to recognize items. However, this also yields some unwanted information along with my required data and my scraper takes a long time scraping unwanted information. To avoid this, I have put an upper limit on the number of items scraped. By using an "if" condition, I am raising a CloseSpider() exception on reaching the upper limit. This approach worked fine until I had only one domain to scrape. How do I extend it for multiple domains.

class CustomSpider(CrawlSpider):
name = "myspider"
start_urls = ['https://www.example1.com/']
allowed_domains = ['www.example1.com']
rules = [Rule(LinkExtractor(allow=()), callback='parse_info', follow = True)]

def parse_info(self, response):
    scrape_count = self.crawler.stats.get_value('item_scraped_count')
    if scrape_count == 20:
        raise CloseSpider("Limit Reached")

My question is, how to extend this code for the following scenario:

class CustomSpider(CrawlSpider):
name = "myspider"
start_urls = ['https://www.example1.com/', 'https://www.example2.com/']
allowed_domains = ['www.example1.com', 'www.example2.com/']
rules = [Rule(LinkExtractor(allow=()), callback='parse_info', follow = True)]

def parse_info(self, response):
suggest change in logic here
    scrape_count = self.crawler.stats.get_value('item_scraped_count')
    if scrape_count == 20:
        raise CloseSpider("Limit Reached")
2
it depends on how you are relating an item to a domain, does it have a field that indicates to which domain it belongs? something like item = {'domain': 'www.example2.com'}?eLRuLL
Currently it does nothing like that. Assuming I do this part, how can I achieve the desired logic?user3797806

2 Answers

3
votes

See this toy example:

from __future__ import print_function

import collections
try:
    from urllib.urlparse import urlsplit
except ImportError:
    from urlparse import urlsplit

from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


class MySpider(CrawlSpider):
    name = 'myspider'
    start_urls = ['http://quotes.toscrape.com/',
                  'http://webscraper.io/test-sites']
    allowed_domains = ['quotes.toscrape.com', 'webscraper.io']

    scraped_count = collections.defaultdict(int)
    limit = 10

    rules = [Rule(LinkExtractor(allow=()), callback='parse_page',
                  follow=True, process_request='process_request')]

    def parse_page(self, response):
        yield {
            'url': response.url
        }

    def process_request(self, request):
        url = urlsplit(request.url)[1]
        if self.scraped_count[url] < self.limit:
            self.scraped_count[url] += 1
            return request
        else:
            print('Limit reached for {}'.format(url))

It keeps track of the number of items scraped per domain in attribute scraped_count. Attribute limit holds the limit per domain. The logic is put inside the process_request method that's passed as an argument to Rule and that gets called for every request extracted by that rule (see the documentation). When you are over the limit, request gets filtered, otherwise it's returned unchanged and gets processed.

If you need something more sophisticated or applicable to multiple spiders, I'd suggest you extend CloseSpider extension class, implement the logic there and replace the default class in the settings.py.

1
votes

You can use CLOSESPIDER_ITEMCOUNT

An integer which specifies a number of items. If the spider scrapes more than that amount and those items are passed by the item pipeline, the spider will be closed with the reason closespider_itemcount. Requests which are currently in the downloader queue (up to CONCURRENT_REQUESTS requests) are still processed. If zero (or non set), spiders won’t be closed by number of passed items.