I am trying to do realize a CrawlSpider with Scrapy with the following features. Basically, my start url contains various list of urls which are divided up in sections. I want to scrape just the urls from a specific section and then crawl them. In order to do this, I defined my link extractor using restrict_xpaths, in order to isolate the links I want to crawl from the rest. However, because of the restrict_xpaths, when the spider tries to crawl a link which is not the start url, it stops, since it does not find any links. So I tried to add another rule, which is supposed to assure that the links outside the start url get crawled, through the use of deny_domains applied to the start_url. However, this solution is not working. Can anyone suggest a possible strategy? Right now my rules are :
rules = {Rule(LinkExtractor(restrict_xpaths=(".//*[@id='mw-content- text']/ul[19]"), ), callback='parse_items', follow=True),
Rule(LinkExtractor(deny_domains='...start url...'), callback='parse_items',follow= True),}