I try to crawl the forum category of craiglist.org (https://forums.craigslist.org/). My spider:
class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']
def error_handler(self, failure):
print failure
def parse(self, response):
yield Request('https://forums.craigslist.org/',
self.getForumPage,
dont_filter=True,
errback=self.error_handler)
def getForumPage(self, response):
print "forum page"
I have this message by the error callback:
[Failure instance: Traceback: : /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult --- --- /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks /usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator /usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks /usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2 ]
But i have this problem only with the forum section of Craigslist. It might be because is https for the forum section in contrary of the rest of website. So, impossible to get a response...
An idea ?