Sequentially crawl website using scrapy

Question

Is there a way to tell scrapy to stop crawling based upon condition in the 2nd level page? I am doing the following:

I have a start_url to begin with (1st level page)
I have set of urls extracted from the start_url using parse(self, response)
Then I add queue the links using Request with callback as parseDetailPage(self, response)
Under parseDetail (2nd level page) I come to know if I can stop crawling or not

Right now I am using CloseSpider() to accomplish this, but the problem is that the urls to be parsed are already queued by the time I start crawling second level pages and I do not know how to remove them from the queue. Is there a way to sequentially crawl the list of links and then be able to stop in parseDetailPage?

global job_in_range    
start_urls = []
start_urls.append("http://sfbay.craigslist.org/sof/")
def __init__(self):
    self.job_in_range = True
def parse(self, response):
    hxs = HtmlXPathSelector(response)
    results = hxs.select('//blockquote[@id="toc_rows"]')
    items = []
    if results:
        links = results.select('.//p[@class="row"]/a/@href')
        for link in links:
            if link is self.end_url:
                break;
            nextUrl = link.extract()
            isValid = WPUtil.validateUrl(nextUrl);
            if isValid:
                item = WoodPeckerItem()
                item['url'] = nextUrl
                item = Request(nextUrl, meta={'item':item},callback=self.parseDetailPage)
                items.append(item)
    else:
        self.error.log('Could not parse the document')
    return items

def parseDetailPage(self, response):
    if self.job_in_range is False:
        raise CloseSpider('End date reached - No more crawling for ' + self.name)
    hxs = HtmlXPathSelector(response)
    print response
    body = hxs.select('//article[@id="pagecontainer"]/section[@class="body"]')
    item = response.meta['item']
    item['postDate'] = body.select('.//section[@class="userbody"]/div[@class="postinginfos"]/p')[1].select('.//date/text()')[0].extract()
    if item['jobTitle'] is 'Admin':
        self.job_in_range = False
        raise CloseSpider('Stop crawling')
    item['jobTitle'] = body.select('.//h2[@class="postingtitle"]/text()')[0].extract()
    item['description'] = body.select(str('.//section[@class="userbody"]/section[@id="postingbody"]')).extract()
    return item

Java Xu Java Xu · Accepted Answer · 2013-02-22T06:55:26

Do you mean that you would like to stop the spider and resume it without parsing the urls which have been parsed? If so, you may try to set the JOB_DIR setting. This setting can keep the request.queue in specified file on the disk.

Sequentially crawl website using scrapy

1 Answers