1
votes

I am trying to do realize a CrawlSpider with Scrapy with the following features. Basically, my start url contains various list of urls which are divided up in sections. I want to scrape just the urls from a specific section and then crawl them. In order to do this, I defined my link extractor using restrict_xpaths, in order to isolate the links I want to crawl from the rest. However, because of the restrict_xpaths, when the spider tries to crawl a link which is not the start url, it stops, since it does not find any links. So I tried to add another rule, which is supposed to assure that the links outside the start url get crawled, through the use of deny_domains applied to the start_url. However, this solution is not working. Can anyone suggest a possible strategy? Right now my rules are :

    rules = {Rule(LinkExtractor(restrict_xpaths=(".//*[@id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True), 
     Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),}
1

1 Answers

0
votes

You're defining a Set by using {} around the pair of rules. Try making it a tuple with ():

 rules = (Rule(LinkExtractor(restrict_xpaths=(".//*[@id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True), 
 Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),)

Beyond that, you might want to pass 'unique=True' to the Rules to make sure that any links back to the "start url" are not followed. See BaseSgmlLinkExtractor

Also, the use of 'parse_items' as a call back to both LinkExtractors is a bit of a smell. Based on your explanation, I can't see that the first extractor would need a callback.... it's just extracting links that should be added to the queue for the Scraper to go fetch, right?

The real scraping for data that you want to use/persist generally happens in the 'parse_items' callback (at least that's the convention used in the docs).