2
votes

I have written a crawl spider within a scrapy project that properly scrapes data from a url and the pipelines the response into a postgresql table, but only when the scrapy crawl command is used. When the spider is run from a script in the root directory of the project, it appears that only the parse method of the spider class is being called as the table is not being created upon running the script simply using the python command. I think the problem is that the crawl command has a specific protocol for looking for and calling specific modules in the directory above the spiders package (e.g. the models, pipelines, and settings modules) which aren't being called when the spider is run from a script.

I followed the directions included in the docs but they don't seem to address pipelining data after it is scraped. This raises the question of I should even be trying to run a script to run the spider or if I should just use the scrapy crawl command somehow. The problem is, I planned to run the scrapy spider from a django project when the user submits text in a form which lead me to this SO post, but the provided answer doesn't seem to be addressing the my problem. I would also need to pass the text from the form to be added to the spider url (I was previously just using raw_input to create the url). How should I properly go about running the spider? I have the code for the script and the spider below if they are needed. Any help/code provided would be appreciated, thanks.

script file

from ticket_city_scraper import *
from ticket_city_scraper.spiders import tc_spider 

tc_spider.spiderCrawl()

spider file

import scrapy
import re
import json
from scrapy.crawler import CrawlerProcess
from scrapy import Request
from scrapy.contrib.spiders import CrawlSpider , Rule
from scrapy.selector import HtmlXPathSelector
from scrapy.selector import Selector
from scrapy.contrib.loader import ItemLoader
from scrapy.contrib.loader import XPathItemLoader
from scrapy.contrib.loader.processor import Join, MapCompose
from ticket_city_scraper.items import ComparatorItem
from urlparse import urljoin

bandname = raw_input("Enter bandname\n")
tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  

class MySpider3(CrawlSpider):
    handle_httpstatus_list = [416]
    name = 'comparator'
    allowed_domains = ["www.ticketcity.com"]

    start_urls = [tc_url]
    tickets_list_xpath = './/div[@class = "vevent"]'
    def create_link(self, bandname):
        tc_url = "https://www.ticketcity.com/concerts/" + bandname + "-tickets.html"  
        self.start_urls = [tc_url]
        #return tc_url      

    tickets_list_xpath = './/div[@class = "vevent"]'

    def parse_json(self, response):
        loader = response.meta['loader']
        jsonresponse = json.loads(response.body_as_unicode())
        ticket_info = jsonresponse.get('B')
        price_list = [i.get('P') for i in ticket_info]
        if len(price_list) > 0:
            str_Price = str(price_list[0])
            ticketPrice = unicode(str_Price, "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        else:
            ticketPrice = unicode("sold out", "utf-8")
            loader.add_value('ticketPrice', ticketPrice)
        return loader.load_item()

    def parse_price(self, response):
        print "parse price function entered \n"
        loader = response.meta['loader']
        event_City = response.xpath('.//span[@itemprop="addressLocality"]/text()').extract() 
        eventCity = ''.join(event_City) 
        loader.add_value('eventCity' , eventCity)
        event_State = response.xpath('.//span[@itemprop="addressRegion"]/text()').extract() 
        eventState = ''.join(event_State) 
        loader.add_value('eventState' , eventState) 
        event_Date = response.xpath('.//span[@class="event_datetime"]/text()').extract() 
        eventDate = ''.join(event_Date)  
        loader.add_value('eventDate' , eventDate)    
        ticketsLink = loader.get_output_value("ticketsLink")
        json_id_list= re.findall(r"(\d+)[^-]*$", ticketsLink)
        json_id=  "".join(json_id_list)
        json_url = "https://www.ticketcity.com/Catalog/public/v1/events/" + json_id + "/ticketblocks?P=0,99999999&q=0&per_page=250&page=1&sort=p.asc&f.t=s&_=1436642392938"
        yield scrapy.Request(json_url, meta={'loader': loader}, callback = self.parse_json, dont_filter = True) 

    def parse(self, response):
        """
        # """
        selector = HtmlXPathSelector(response)
        # iterate over tickets
        for ticket in selector.select(self.tickets_list_xpath):
            loader = XPathItemLoader(ComparatorItem(), selector=ticket)
            # define loader
            loader.default_input_processor = MapCompose(unicode.strip)
            loader.default_output_processor = Join()
            # iterate over fields and add xpaths to the loader
            loader.add_xpath('eventName' , './/span[@class="summary listingEventName"]/text()')
            loader.add_xpath('eventLocation' , './/div[@class="divVenue location"]/text()')
            loader.add_xpath('ticketsLink' , './/a[@class="divEventDetails url"]/@href')
            #loader.add_xpath('eventDateTime' , '//div[@id="divEventDate"]/@title') #datetime type
            #loader.add_xpath('eventTime' , './/*[@class = "productionsTime"]/text()')

            print "Here is ticket link \n" + loader.get_output_value("ticketsLink")
            #sel.xpath("//span[@id='PractitionerDetails1_Label4']/text()").extract()
            ticketsURL = "https://www.ticketcity.com/" + loader.get_output_value("ticketsLink")
            ticketsURL = urljoin(response.url, ticketsURL)
            yield scrapy.Request(ticketsURL, meta={'loader': loader}, callback = self.parse_price, dont_filter = True)

def spiderCrawl():
   process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
   })
   process.crawl(MySpider3)
   process.start()
1

1 Answers

4
votes

To Answer Your Question

  1. Scrapy Does not differentiate between crawl command and crawl command line( from Script ) execution.

only part (and difference) that you are missing is :

  1. scrapy crawl command... always and must be executed from within the project directory ..where scrapy.cfg file is located....and if you look closely , it contains where the setting file is located..and setting file is the central location where all your project specific settings are located..like..cache policy , pipelines , header setting, proxy setting ..etc so while using scrapy crawl..all this setting are internally loaded
  2. for Scrapy execution from script...you are just providing the location of the spider and where it is located and executing it without any of your custom setting from settings.py file

for this setting to come into effect..create crawlprocess object with project setting ..

settings = get_project_settings()
settings.set('USER_AGENT','Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)')
process = CrawlerProcess(settings)
process.crawl(MySpider3)
process.start()