I'm trying to do a limited broad (across multiple domains) crawl with scrappy, where I start on a summary/index page, follow specific links on that page, parse the linked sites, and follow all links from those sites. It's basically using the following schema:
1) Follow rule-selected links from start_url site (this site doesn't need to be parsed).
2) Parse the pages linked from start_url with custom method (def parse_it()
).
3) Follow all links on the pages parsed under 2), and parse the linked pages.
I can do 1) and 2) easily with the CrawlSpider. But I have to define a link extractor rule that only follows the links I need (in the example below, actual op-ed pages from the NYT's opinion page). What then happens in 3) is that the Spider only follows links from the pages parsed under 2) that match the link extractor rule -- as would be expected. Here's the relevant code:
class xSpider(CrawlSpider):
name = "x"
start_urls = [
"http://www.nytimes.com/pages/opinion"
]
rules = (
Rule(LinkExtractor(allow=(r'/\d{4}/\d{2}/\d{2}')), callback='parse_it', follow=True),
)
def parse_it(self, response):
<my parse method>
My question is: how could I apply the rule I have above to the start URL, but then set a new rule for extracting links (allow=()
) for the subsequent rank of pages? I know that the CrawlSpider has a parse_start_URL
function, but I don't see any obvious way of attaching the above rule only to the start_URL and defining a different rule for subsequent pages.
Edit (thinking out loud): or is it just easier to do this with the Request
library and write a basic custom crawler