0
votes

I am scraping a news website with Scrapy and saving scraped items to a database with sqlalchemy. The crawling job runs periodically and I would like to ignore URLs which did not change since the last crawling.

I am trying to subclass LinkExtractor and return an empty list in case the response.url has been crawled more recently than updated.

But when I run 'scrapy crawl spider_name' I am getting:

TypeError: MyLinkExtractor() got an unexpected keyword argument 'allow'

The code:

def MyLinkExtractor(LinkExtractor):
    '''This class should redefine the method extract_links to
    filter out all links from pages which were not modified since
    the last crawling'''
    def __init__(self, *args, **kwargs):
        """
        Initializes database connection and sessionmaker.
        """
        engine = db_connect()
        self.Session = sessionmaker(bind=engine)
        super(MyLinkExtractor, self).__init__(*args, **kwargs)

    def extract_links(self, response):
        all_links = super(MyLinkExtractor, self).extract_links(response)

        # Return empty list if current url was recently crawled
        session = self.Session()
        url_in_db = session.query(Page).filter(Page.url==response.url).all()
        if url_in_db and url_in_db[0].last_crawled.replace(tzinfo=pytz.UTC) > item['header_last_modified']:
            return []

        return all_links

...

class MySpider(CrawlSpider):

    def __init__(self, *args, **kwargs):
        """
        Initializes database connection and sessionmaker.
        """
        engine = db_connect()
        self.Session = sessionmaker(bind=engine)
        super(MySpider, self).__init__(*args, **kwargs)

    ...

    # Define list of regex of links that should be followed
    links_regex_to_follow = [
        r'some_url_pattern',
        ]

    rules = (Rule(MyLinkExtractor(allow=links_regex_to_follow),
                  callback='handle_news',
                  follow=True),    
             )

    def handle_news(self, response):

        item = MyItem()
        item['url'] = response.url
        session = self.Session()

        # ... Process the item and extract meaningful info

        # Register when the item was crawled
        item['last_crawled'] = datetime.datetime.utcnow().replace(tzinfo=pytz.UTC)

        # Register when the page was last-modified
        date_string = response.headers.get('Last-Modified', None).decode('utf-8')
        item['header_last_modified'] = get_datetime_from_http_str(date_string)

        yield item

The most weird thing is that, if I replace MyLinkExtractor for LinkExtractor in the Rule definition, it runs.

But if I leave MyLinkExtractor in the Rule definition and redefine MyLinkExtractor to:

def MyLinkExtractor(LinkExtractor):
    '''This class should redefine the method extract_links to
    filter out all links from pages which were not modified since
    the last crawling'''
    pass

I get the same error.

1

1 Answers

2
votes

Your MyLinkExtractor is not a class, but function since you've declared it with def instead of class. It's hard to spot, since Python allows declaring functions inside of other functions and none of the names are really reserved.

Anyway, I believe that stack-trace would be little different in case if it would be not properly instantiated class - you'd see name of the last function that errored (MyLinkExtractor's __init__).