Trying to get Scrapy to crawl multiple domains. i had it working briefly, but something changed and i have no idea what. my understanding is that a "CrawlSpider" with rules about which links to follow should follow any allowed links until either the depth setting or the domain is exhausted.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BbcSpider(CrawlSpider):
name = 'bbcnews'
allowed_domains = [
'www.bbc.com']
start_urls = [
'http://www.bbc.com/news']
rules = (Rule(LinkExtractor(allow=()), callback='parse', follow=True),)
def parse(self, response):
print(response.url)
and if i wanted to crawl multiple domains i would change the "allowed_domains" and "start_urls"
allowed_domains = [
'www.bbc.com',
'www.voanews.com']
start_urls = [
'http://www.bbc.com/news',
'https://www.voanews.com/p/6290.html']
However, in both cases when i run "scrapy crawl bbcnews", the crawler only retrieves the source sites then exits.
edit:
ok this code works as long as there is only 1 domain and 1 start URL. if i add more than one of either, the spider only crawls the start pages.
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class BbcfarsiSpider(CrawlSpider):
name = 'bbcfarsi'
allowed_domains = ['bbc.com']
start_urls = ['http://bbc.com/persian']
rules = (Rule(LinkExtractor(allow=('persian'),), callback='parse', follow=True), )
def parse(self, response):
pass
edit #2: if i move the parse function outside the class, it works no problem. The issue with that is that i cannot output anything that way. having the parse function in the class (even if it is just filled with pass) results in only requesting the start pages and robot.txt