1
votes

I am new to scrapy and am trying to scrape the title for the following website https://www.mdcalc.com/heart-score-major-cardiac-events

I reviewed all the previous posts on this subject but am still getting then open ssl error

Here is my code: settings.py

DOWNLOADER_CLIENTCONTEXTFACTORY ='scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'

Here is the code for my spider

import scrapy
from skitter.items import SkitterItem

class mdcalc(scrapy.Spider):
    name = "mdcalc"
    allowed_domains = "mdcalc.com"
    start_urls = ['https://www.mdcalc.com/heart-score-major-cardiac-events']

def parse(self, response) :
    item = SkitterItem()
    item['title'] = response.xpath('//h1//text()').extract()[0]
    yield item

When I run

curl localhost:6800/schedule.json -d project=skitter -d spider=mdcalc

Here is the error I get

2017-09-27 02:02:23+0000 [scrapy] INFO: Scrapy 0.24.6 started (bot: skitter)
2017-09-27 02:02:23+0000 [scrapy] INFO: Optional features available: ssl, 
http11
2017-09-27 02:02:23+0000 [scrapy] INFO: Overridden settings: 
{'NEWSPIDER_MODULE': 'skitter.spiders', 'ROBOTSTXT_OBEY': True, 
'SPIDER_MODULES': 
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled extensions: FeedExporter, 
LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled downloader middlewares: 
RobotsTxtMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, 
UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, 
MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, 
CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled spider middlewares: 
HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
UrlLengthMiddleware, DepthMiddleware
2017-09-27 02:02:23+0000 [scrapy] INFO: Enabled item pipelines: 
ElasticSearchPipeline
2017-09-27 02:02:23+0000 [mdcalc] INFO: Spider opened
2017-09-27 02:02:23+0000 [mdcalc] INFO: Crawled 0 pages (at 0 pages/min), 
scraped 0 items (at 0 items/min)
2017-09-27 02:02:23+0000 [scrapy] DEBUG: Telnet console listening on 
127.0.0.1:6024
2017-09-27 02:02:23+0000 [scrapy] DEBUG: Web service listening on 
127.0.0.1:6081
2017-09-27 02:02:23+0000 [mdcalc] DEBUG: Retrying <GET 
https://www.mdcalc.com/robots.txt> (failed 1 times): 
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:27+0000 [mdcalc] DEBUG: Retrying <GET 
https://www.mdcalc.com/heart-score-major-cardiac-events> (failed 1 times): 
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:32+0000 [mdcalc] DEBUG: Retrying <GET 
https://www.mdcalc.com/robots.txt> (failed 2 times): 
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:38+0000 [mdcalc] DEBUG: Retrying <GET 
https://www.mdcalc.com/heart-score-major-cardiac-events> (failed 2 times): 
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:45+0000 [mdcalc] DEBUG: Gave up retrying <GET 
https://www.mdcalc.com/robots.txt> (failed 3 times): 
[<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:45+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] 
ERROR: Unhandled error in Deferred:
2017-09-27 02:02:45+0000 [HTTP11ClientProtocol (TLSMemoryBIOProtocol),client] 
Unhandled Error
    Traceback (most recent call last):
    Failure: twisted.web._newclient.ResponseNeverReceived: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]

2017-09-27 02:02:52+0000 [mdcalc] DEBUG: Gave up retrying <GET https://www.mdcalc.com/heart-score-major-cardiac-events> (failed 3 times): [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:52+0000 [mdcalc] ERROR: Error downloading <GET https://www.mdcalc.com/heart-score-major-cardiac-events>: [<twisted.python.failure.Failure <class 'OpenSSL.SSL.Error'>>]
2017-09-27 02:02:52+0000 [mdcalc] INFO: Closing spider (finished)
2017-09-27 02:02:52+0000 [mdcalc] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 6,

'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 6, 'downloader/request_bytes': 1614, 'downloader/request_count': 6,
'downloader/request_method_count/GET': 6, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2017, 9, 27, 2, 2, 52, 62313), 'log_count/DEBUG': 8, 'log_count/ERROR': 3, 'log_count/INFO': 7, 'scheduler/dequeued': 3, 'scheduler/dequeued/memory': 3, 'scheduler/enqueued': 3, 'scheduler/enqueued/memory': 3, 'start_time': datetime.datetime(2017, 9, 27, 2, 2, 23, 380740)} 2017-09-27 02:02:52+0000 [mdcalc] INFO: Spider closed (finished)

Thanks in advance for your help.

1
Is there a way that I can run my curl command to avoid this error?affemann2
Could you post whole log/trace output, not just that snippet? I suspect there is some important information missing.Tomáš Linhart
Works fine for me. So no issues with the site for sure. Try removing DOWNLOADER_CLIENTCONTEXTFACTORY from your settings and see if that helps. Also see if you have the latest scrapy version on the scrapyd server. use pip install scrapy --force --upgrade to get the latest oneTarun Lalwani
@TomášLinhart just updated the post with the full output. Thank you so much for your helpaffemann2
@TarunLalwani Hi, I just deleted the DOWNLOADER_CLIENTCONTEXTFACTORY from my settings.py and updated my scrapy. Still getting the same error when I run my curl commandaffemann2

1 Answers

0
votes

it's because of the version of python that scrapinghub cloud runs by default, which is 2.7, to fix that you have to specify which version of python your spider must use, python3, this link explain how to do it. https://support.scrapinghub.com/support/solutions/articles/22000200387-deploying-python-3-spiders-to-scrapy-cloud