0
votes

Is there a way to tell scrapy to stop crawling based upon condition in the 2nd level page? I am doing the following:

  1. I have a start_url to begin with (1st level page)
  2. I have set of urls extracted from the start_url using parse(self, response)
  3. Then I add queue the links using Request with callback as parseDetailPage(self, response)
  4. Under parseDetail (2nd level page) I come to know if I can stop crawling or not

Right now I am using CloseSpider() to accomplish this, but the problem is that the urls to be parsed are already queued by the time I start crawling second level pages and I do not know how to remove them from the queue. Is there a way to sequentially crawl the list of links and then be able to stop in parseDetailPage?

global job_in_range    
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
    self.job_in_range = True
def parse(self, response):
    hxs = HtmlXPathSelector(response)
    results = hxs.select('//blockquote[@id="toc_rows"]')
    items = []
    if results:
        links = results.select('.//p[@class="row"]/a/@href')
        for link in links:
            if link is self.end_url:
                break;
            nextUrl = link.extract()
            isValid = WPUtil.validateUrl(nextUrl);
            if isValid:
                item = WoodPeckerItem()
                item['url'] = nextUrl
                item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
                items.append(item)
    else:
        self.error.log('Could not parse the document')
    return items

def parseDetailPage(self, response):
    if self.job_in_range is False:
        raise CloseSpider('End date reached - No more crawling for ' + self.name)
    hxs = HtmlXPathSelector(response)
    print response
    body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]')
    item = response.meta['item']
    item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1].select('.//date/text()')[0].extract()
    if item['jobTitle'] is 'Admin':
        self.job_in_range = False
        raise CloseSpider('Stop crawling')
    item['jobTitle'] = body.select('.//h2[@class="postingtitle"]/text()')[0].extract()
    item['description'] = body.select(str('.//section[@class="userbody"]/section[@id="postingbody"]')).extract()
    return item
1

1 Answers

0
votes

Do you mean that you would like to stop the spider and resume it without parsing the urls which have been parsed? If so, you may try to set the JOB_DIR setting. This setting can keep the request.queue in specified file on the disk.