Control/limit broad crawl with scrapy

Question

I'm trying to do a limited broad (across multiple domains) crawl with scrappy, where I start on a summary/index page, follow specific links on that page, parse the linked sites, and follow all links from those sites. It's basically using the following schema:

1) Follow rule-selected links from start_url site (this site doesn't need to be parsed).

2) Parse the pages linked from start_url with custom method (def parse_it()).

3) Follow all links on the pages parsed under 2), and parse the linked pages.

I can do 1) and 2) easily with the CrawlSpider. But I have to define a link extractor rule that only follows the links I need (in the example below, actual op-ed pages from the NYT's opinion page). What then happens in 3) is that the Spider only follows links from the pages parsed under 2) that match the link extractor rule -- as would be expected. Here's the relevant code:

class xSpider(CrawlSpider):
name = "x"
start_urls = [
    "http://www.nytimes.com/pages/opinion"
]

rules = (      
    Rule(LinkExtractor(allow=(r'/\d{4}/\d{2}/\d{2}')), callback='parse_it', follow=True),
)

def parse_it(self, response):
    <my parse method>

My question is: how could I apply the rule I have above to the start URL, but then set a new rule for extracting links (allow=()) for the subsequent rank of pages? I know that the CrawlSpider has a parse_start_URL function, but I don't see any obvious way of attaching the above rule only to the start_URL and defining a different rule for subsequent pages.

Edit (thinking out loud): or is it just easier to do this with the Request library and write a basic custom crawler

Damián Castro Damián Castro · Accepted Answer · 2015-11-25T17:27:35

You could add a new set of rules and use those rules in the parse_it method.

class xSpider(CrawlSpider):
    name = "x"
    start_urls = [
        "http://www.nytimes.com/pages/opinion"
    ]

    rules = (      
        Rule(LinkExtractor(allow=(r'/\d{4}/\d{2}/\d{2}')),
             callback='parse_it', follow=True
        ),
    )
    other_rules = (
        Rule(LinkExtractor(), callback=self.parse_it, follow=True),
    )

    def _compile_other_rules(self, rules):
        def get_method(method):
            if callable(method):
                return method
            elif isinstance(method, basestring):
                return getattr(self, method, None)
        for rule in rules:
            rule.callback = get_method(rule.callback)
            rule.process_links = get_method(rule.process_links)
            rule.process_request = get_method(rule.process_request)
        return rules

    def parse_it(self, response):
        """Code 'borrowed' from CrawlSpider._requests_to_follow method,
        adapted for our needs
        """
        if not isinstance(response, HtmlResponse):
            return
        compiled_rules = self._compile_other_rules() #rules need to be compiled
        seen = set()
        for n, rule in enumerate(compiled_rules):
            links = [
                       l for l in rule.link_extractor.extract_links(response) 
                       if l not in seen
            ]
            if links and rule.process_links:
                links = rule.process_links(links)
            for link in links:
                seen.add(link)
                r = Request(url=link.url, callback=self._response_downloaded)
                r.meta.update(rule=n, link_text=link.text)
                yield rule.process_request(r)

Control/limit broad crawl with scrapy

1 Answers