Scrapy - Follow RSS links

Question

I was wondering if anyone ever tried to extract/follow RSS item links using SgmlLinkExtractor/CrawlSpider. I can't get it to work...

I am using the following rule:


   rules = (
       Rule(SgmlLinkExtractor(tags=('link',), attrs=False),
           follow=True,
           callback='parse_article'),
       )

(having in mind that rss links are located in the link tag).

I am not sure how to tell SgmlLinkExtractor to extract the text() of the link and not to search the attributes ...

Any help is welcome, Thanks in advance

Pablo Hoffman Pablo Hoffman · Accepted Answer · 2010-09-19T20:29:13

CrawlSpider rules don't work that way. You'll probably need to subclass BaseSpider and implement your own link extraction in your spider callback. For example:

from scrapy.spider import BaseSpider
from scrapy.http import Request
from scrapy.selector import XmlXPathSelector

class MySpider(BaseSpider):
    name = 'myspider'

    def parse(self, response):
        xxs = XmlXPathSelector(response)
        links = xxs.select("//link/text()").extract()
        return [Request(x, callback=self.parse_link) for x in links]

You can also try the XPath in the shell, by running for example:

scrapy shell http://blog.scrapy.org/rss.xml

And then typing in the shell:

>>> xxs.select("//link/text()").extract()
[u'http://blog.scrapy.org',
 u'http://blog.scrapy.org/new-bugfix-release-0101',
 u'http://blog.scrapy.org/new-scrapy-blog-and-scrapy-010-release']

Scrapy - Follow RSS links

4 Answers