2
votes

I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls only a certain depth via e.g. DEPTH_LIMIT = 2).

class LinkNetworkSpider(CrawlSpider):

    name = "network"
    allowed_domains = ["exampleA.com"]

    start_urls = ["http://www.exampleA.com"]

    rules = (Rule(SgmlLinkExtractor(allow=()), callback='parse_item', follow=True),)

    def parse_start_url(self, response):
        return self.parse_item(response)

    def parse_item(self, response):

        hxs = HtmlXPathSelector(response)
        links = hxs.select('//a/@href').extract()

        outgoing_links = []

        for link in links:
            if ("http://" in link):
                base_url = urlparse(link).hostname
                base_url = base_url.split(':')[0]  # drop ports
                base_url = '.'.join(base_url.split('.')[-2:])  # remove subdomains
                url_hit = sum(1 for i in self.allowed_domains if base_url not in i)
                if url_hit != 0:
                    outgoing_links.append(link)

        if outgoing_links:
            item = LinkNetworkItem()
            item['internal_site'] = response.url
            item['out_links'] = outgoing_links
            return [item]
        else:
            return None

I want to extend this to multiple domains (exampleA.com, exampleB.com, exampleC.com ...). At first, I thought i can just add my list to start_urls as well as allowed_domains but in my opinion this causes following problems:

  • Will the settings DEPTH_LIMIT be applied for each start_urls/allowed_domain?
  • More important: If the sites are connected will the spider jump from exampleA.com to exampleB.com because both are in allowed_domains? I need to avoid this criss-cross as I later on want to count the outbound links for each site to gain information about the relationship between the websites!

So how can i scale more spider without running into the problem of criss-crossing and using the settings per website?

Additional image showing what i would like to realize: scrapy

2

2 Answers

3
votes

I have now achieved it without rules. I attached a meta attribute to every start_url and then simply check myself whether the links belong to the original domain and sent out new requests correspondingly.

Therefore, override start_requests:

def start_requests(self):
    return [Request(url, meta={'domain': domain}, callback=self.parse_item) for url, domain in zip(self.start_urls, self.start_domains)]

In subsequent parsing methods we grab the meta attribute domain = response.request.meta['domain'], compare the domain with the extracted links and sent out new requests ourselves.

1
votes

You would probably need to keep a data structure (ex a hashmap) of URLs that the crawler has already visited. Then it's just a matter of adding URLs to the hashmap as you visit them and not visiting URLs if they're in the hashmap already (as this means you have already visited them). There are probably more complicated ways of doing this which would give you greater performace, but these would also be harder to implement.