1
votes

Trying to get Scrapy to crawl multiple domains. i had it working briefly, but something changed and i have no idea what. my understanding is that a "CrawlSpider" with rules about which links to follow should follow any allowed links until either the depth setting or the domain is exhausted.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BbcSpider(CrawlSpider):
    name = 'bbcnews'
    allowed_domains = [
        'www.bbc.com']

    start_urls = [
        'http://www.bbc.com/news']

    rules = (Rule(LinkExtractor(allow=()), callback='parse', follow=True),)

    def parse(self, response):
        print(response.url)

and if i wanted to crawl multiple domains i would change the "allowed_domains" and "start_urls"

     allowed_domains = [
        'www.bbc.com',
        'www.voanews.com']

    start_urls = [
        'http://www.bbc.com/news',
        'https://www.voanews.com/p/6290.html']

However, in both cases when i run "scrapy crawl bbcnews", the crawler only retrieves the source sites then exits.

edit:

ok this code works as long as there is only 1 domain and 1 start URL. if i add more than one of either, the spider only crawls the start pages.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BbcfarsiSpider(CrawlSpider):
    name = 'bbcfarsi'
    allowed_domains = ['bbc.com']
    start_urls = ['http://bbc.com/persian']
    rules = (Rule(LinkExtractor(allow=('persian'),), callback='parse', follow=True), )

def parse(self, response):
    pass

edit #2: if i move the parse function outside the class, it works no problem. The issue with that is that i cannot output anything that way. having the parse function in the class (even if it is just filled with pass) results in only requesting the start pages and robot.txt

2

2 Answers

0
votes

I guess when you are using callback='parse', it is falling back to the inbuilt parse method. Could you try using callback='self.parse' instead? That might trigger your parse method and not the default one

0
votes

not really sure why, but if i changed the rule callback to callback='parse_link' and renamed the function to match, everything worked just fine. code should look something like what is below:

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BbcSpider(CrawlSpider):
    name = 'bbcnews'
    allowed_domains = [
        'www.bbc.com',
        'www.voanews.com',]

    start_urls = [
        'http://www.bbc.com/news',
        'https://www.voanews.com/p/6290.html',]

    rules = (Rule(LinkExtractor(allow=()), callback='parse_link', follow=True),)

    def parse_link(self, response):
        yield {
            'url' : response.url ,
        }

Edit: see Chetan mishra's comment below for the explanation. I clearly did not look at the documentation closely enough.