Scrapy Crawler multiple domains completes with no errors after retrieving source pages

Question

Trying to get Scrapy to crawl multiple domains. i had it working briefly, but something changed and i have no idea what. my understanding is that a "CrawlSpider" with rules about which links to follow should follow any allowed links until either the depth setting or the domain is exhausted.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BbcSpider(CrawlSpider):
    name = 'bbcnews'
    allowed_domains = [
        'www.bbc.com']

    start_urls = [
        'http://www.bbc.com/news']

    rules = (Rule(LinkExtractor(allow=()), callback='parse', follow=True),)

    def parse(self, response):
        print(response.url)

and if i wanted to crawl multiple domains i would change the "allowed_domains" and "start_urls"

     allowed_domains = [
        'www.bbc.com',
        'www.voanews.com']

    start_urls = [
        'http://www.bbc.com/news',
        'https://www.voanews.com/p/6290.html']

However, in both cases when i run "scrapy crawl bbcnews", the crawler only retrieves the source sites then exits.

edit:

ok this code works as long as there is only 1 domain and 1 start URL. if i add more than one of either, the spider only crawls the start pages.

import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor

class BbcfarsiSpider(CrawlSpider):
    name = 'bbcfarsi'
    allowed_domains = ['bbc.com']
    start_urls = ['http://bbc.com/persian']
    rules = (Rule(LinkExtractor(allow=('persian'),), callback='parse', follow=True), )

def parse(self, response):
    pass

edit #2: if i move the parse function outside the class, it works no problem. The issue with that is that i cannot output anything that way. having the parse function in the class (even if it is just filled with pass) results in only requesting the start pages and robot.txt

Chetan Mishra Chetan Mishra · Accepted Answer · 2018-06-04T21:48:22

I guess when you are using callback='parse', it is falling back to the inbuilt parse method. Could you try using callback='self.parse' instead? That might trigger your parse method and not the default one

Scrapy Crawler multiple domains completes with no errors after retrieving source pages

2 Answers