0
votes

I try to crawl the forum category of craiglist.org (https://forums.craigslist.org/). My spider:

class CraigslistSpider(scrapy.Spider):
    name = "craigslist"
    allowed_domains = ["forums.craigslist.org"]
    start_urls = ['http://geo.craigslist.org/iso/us/']

    def error_handler(self, failure):
        print failure

    def parse(self, response):
        yield Request('https://forums.craigslist.org/',
                  self.getForumPage,
                  dont_filter=True,
                  errback=self.error_handler)

    def getForumPage(self, response):
        print "forum page"

I have this message by the error callback:

[Failure instance: Traceback: : /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:455:callback /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:563:_startRunCallbacks /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1316:gotResult --- --- /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:1258:_inlineCallbacks /usr/local/lib/python2.7/site-packages/twisted/python/failure.py:389:throwExceptionIntoGenerator /usr/local/lib/python2.7/site-packages/scrapy/core/downloader/middleware.py:37:process_request /usr/local/lib/python2.7/site-packages/twisted/internet/defer.py:649:_runCallbacks /usr/local/lib/python2.7/site-packages/scrapy/downloadermiddlewares/robotstxt.py:46:process_request_2 ]

But i have this problem only with the forum section of Craigslist. It might be because is https for the forum section in contrary of the rest of website. So, impossible to get a response...

An idea ?

2

2 Answers

0
votes

I post a solution that I found for get around the problem.

I have used urllib2 library. Look:

import urllib2
from scrapy.http import HtmlResponse

class CraigslistSpider(scrapy.Spider):
name = "craigslist"
allowed_domains = ["forums.craigslist.org"]
start_urls = ['http://geo.craigslist.org/iso/us/']

def error_handler(self, failure):
    print failure

def parse(self, response):
    # Get a valid request with urllib2
    req = urllib2.Request('https://forums.craigslist.org/')
    # Get the content of this request
    pageContent = urllib2.urlopen(req).read()
    # Parse the content in a HtmlResponse compatible with Scrapy
    response = HtmlResponse(url=response.url, body=pageContent)
    print response.css(".forumlistcolumns li").extract()

With this solution, you can parse a good request in a valid Scrapy request and use this normaly. There is probably a better method but this one is functional.

0
votes

I think you are dealing with robots.txt. Try running your spider with

custom_settings = {
    "ROBOTSTXT_OBEY": False
}

You can also test it using command line settings: scrapy crawl craigslist -s ROBOTSTXT_OBEY=False.