Crawl multiple domains with Scrapy without criss-cross

Question

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).

class LinkNetworkSpider(CrawlSpider):

    name = "network"
    allowed_domains = ["exampleA.com"]

    start_urls = ["http://www.exampleA.com"]

    rules = (Rule(SgmlLinkExtractor(allow=()), callback='parse_item', follow=True),)

    def parse_start_url(self, response):
        return self.parse_item(response)

    def parse_item(self, response):

        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a/@href').extract()

        outgoing_links = []

        for link in links:
            if ("http://" in link):
                base_url = urlparse(link).hostname
                base_url = base_url.split(':')[0]  # drop ports
                base_url = '.'.join(base_url.split('.')[-2:])  # remove subdomains
                url_hit = sum(1 for i in self.allowed_domains if base_url not in i)
                if url_hit != 0:
                    outgoing_links.append(link)

        if outgoing_links:
            item = LinkNetworkItem()
            item['internal_site'] = response.url
            item['out_links'] = outgoing_links
            return [item]
        else:
            return None

I want to extend this to multiple domains (exampleA.com, exampleB.com, exampleC.com ...). At first, I thought i can just add my list to start_urls as well as allowed_domains but in my opinion this causes following problems:

Will the settings DEPTH_LIMIT be applied for each start_urls/allowed_domain?
More important: If the sites are connected will the spider jump from exampleA.com to exampleB.com because both are in allowed_domains? I need to avoid this criss-cross as I later on want to count the outbound links for each site to gain information about the relationship between the websites!

So how can i scale more spider without running into the problem of criss-crossing and using the settings per website?

Additional image showing what i would like to realize: scrapy

bioslime bioslime · Accepted Answer · 2014-06-22T12:08:41

I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

Therefore, override start_requests:

def start_requests(self):
    return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]

In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.

Crawl multiple domains with Scrapy without criss-cross

2 Answers