0
votes

I have found plenty of information on calling a function when a Scrapy spider quits (viz: Call a function in Settings from spider Scrapy) but I'm looking for how to call a function -- just once -- when the spider opens. Cannot find this in the Scrapy documentation.

I've got a project of multiple spiders that scrape event information and post them to different Google Calendars. The event information is updated often, so before the spider runs, I need to clear out the existing Google Calendar information in order to refresh it entirely. I've got a working function that accomplishes this when passed a calendar ID. Each spider posts to a different Google Calendar, so I need to be able to pass the calendar ID from within the spider to the function that clears the calendar.

I've defined a base spider in init.py that looks like this:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
## import other stuff I need for the clear_calendar() function

class BaseSpider(CrawlSpider):

    def clear_calendar(self, CalId):

        ## working code to clear the calendar

Now I can call that function from within parse_item like:

from myproject import BaseSpider

class ExampleSpider(BaseSpider):

    def parse_item(self, response):

       calendarID = 'MycalendarID'
       self.clear_calendar(MycalendarID)

       ## other stuff to do

And of course that calls the function every single time an item is scraped, which is ridiculous. But if I move the function call outside of def parse_item, I get the error "self is not defined", or, if I remove "self", "clear_calendar is not defined."

How can I call a function that requires an argument just once from within a Scrapy spider? Or, is there a better way to go about this?

1
Off the top of my head... I think you need to declare a spider_opened method... – Jon Clements♦

1 Answers

2
votes

There is totally a better way, with the spider_opened signal.

I think on newer versions of scrapy, there is a spider_opened method ready for you to use inside the spider:

class MySpider(Spider):
    ...        
    calendar_id = 'something'

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        spider = super(MySpider, cls).from_crawler(crawler, *args, **kwargs)
        crawler.signals.connect(spider.spider_opened, signal=signals.spider_opened)
        return spider

    def spider_opened(self):
        calendar_id = self.calendar_id
        # use my calendar_id