2
votes

I start a crawl with a CrawlSpider Derived class, and pause it with Ctrl+C. When I execute the command again to resume it, it does not continue.

My start and resume command:

scrapy crawl mycrawler -s JOBDIR=crawls/test5_mycrawl

Scrapy creates the folder. The permissions are 777.

When I resume the crawl, it just outputs:

/home/adminuser/.virtualenvs/rg_harvest/lib/python2.7/site-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification.  Many valid certificate/hostname mappings may be rejected.
  verifyHostname, VerificationError = _selectVerifyImplementation()
2014-11-21 11:05:10-0500 [scrapy] INFO: Scrapy 0.24.4 started (bot: rg_harvest_scrapy)
2014-11-21 11:05:10-0500 [scrapy] INFO: Optional features available: ssl, http11, django
2014-11-21 11:05:10-0500 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'rg_harvest_scrapy.spiders', 'SPIDER_MODULES': ['rg_harvest_scrapy.spiders'], 'BOT_NAME': 'rg_harvest_scrapy'}
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-11-21 11:05:10-0500 [scrapy] INFO: Enabled item pipelines: ValidateMandatory, TypeConversion, ValidateRange, ValidateLogic, RestegourmetImagesPipeline, SaveToDB
2014-11-21 11:05:10-0500 [mycrawler] INFO: Spider opened
2014-11-21 11:05:10-0500 [mycrawler] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-11-21 11:05:10-0500 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2014-11-21 11:05:10-0500 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2014-11-21 11:05:10-0500 [mycrawler] DEBUG: Crawled (200) <GET http://eatsmarter.de/suche/rezepte> (referer: None)
2014-11-21 11:05:10-0500 [mycrawler] DEBUG: Filtered duplicate request: <GET http://eatsmarter.de/suche/rezepte?page=1> - no more duplicates will be shown (see DUPEFILTER_DEBUG to show all duplicates)
2014-11-21 11:05:10-0500 [mycrawler] INFO: Closing spider (finished)
2014-11-21 11:05:10-0500 [mycrawler] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 225,
     'downloader/request_count': 1,
     'downloader/request_method_count/GET': 1,
     'downloader/response_bytes': 19242,
     'downloader/response_count': 1,
     'downloader/response_status_count/200': 1,
     'dupefilter/filtered': 29,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 11, 21, 16, 5, 10, 733196),
     'log_count/DEBUG': 4,
     'log_count/INFO': 7,
     'request_depth_max': 1,
     'response_received_count': 1,
     'scheduler/dequeued': 1,
     'scheduler/dequeued/disk': 1,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/disk': 1,
     'start_time': datetime.datetime(2014, 11, 21, 16, 5, 10, 528629)}

I have one start_url. Could this be the reason? My crawler uses one start_url and then follows the pagination by a Rule with a LinkExtractor, and calls the parse item by a specific url format:

My Spider code:

class MyCrawlSpiderBase(CrawlSpider):
    name = 'test_spider'

    testmode = True
    crawl_start = datetime.utcnow().isoformat()

    def __init__(self, testmode=True, urls=None, *args, **kwargs):        
        self.testmode = bool(int(testmode))
        super(MyCrawlSpiderBase, self).__init__(*args, **kwargs)        

    def parse_item(self, response):
        # Item Values
        l = MyItemLoader(RecipeItem(), response=response)

        l.replace_value('url', response.url)
        l.replace_value('crawl_start', self.crawl_start)

        return l.load_item()


class MyCrawlSpider(MyCrawlSpiderBase):
    name = 'example_de'
    allowed_domains = ['example.de']
    start_urls = [
        "http://example.de",

    ]

    rules = (
        Rule( 
            LinkExtractor( 
                allow=('/search/entry\?page=', )
            )
        ), 


        Rule(
            LinkExtractor(
                allow=('/entry/[0-9A-z\-]{3,}$', ),
            ), 
            callback='parse_item'
        ),
    )

    def parse_item(self, response):
        item = super(MyCrawlSpider, self).parse_item(response)

        l = MyItemLoader(item=item, response=response)

        l.replace_xpath("name", "//h1[@class='fn title']/text()")         

        (...)

        return l.load_item()
2
post your spider code please. do you use cookies? or request serialization?Nima Soroush
I added the spider code. I do not use cookies, and I`m not sure if I use request serialization...user1383029

2 Answers

5
votes

Since your URL is always the same, the requests are most likely being filtered. You can solve this in two ways:

  1. In yoursettings.py file, add this line:
    DUPEFILTER_CLASS = 'scrapy.dupefilter.BaseDupeFilter'
    This replaces the default RFPDupeFilter with the BaseDupeFilter which will not filter any requests. This my not be what you want if you in fact want to filter out some other requests not relevant to this question.

  2. You can get more involved in the process of creating requests, and create them with the parameter dont_filter=True, which will disable filtering on a per-requests basis. To achieve this, you could remove the start_urls and replace it with a method start_requests() that would yield the requests for parsing. Check out more info in the official documentation.

2
votes

If you click Ctrl+C twice (force stop) it won't be able to be continued. Click Ctrl+C just once and wait.