I have set up a CrawlSpider aggregating all outbound links (crawling from start_urls
only a certain depth via e.g. DEPTH_LIMIT = 2
).
class LinkNetworkSpider(CrawlSpider):
name = "network"
allowed_domains = ["exampleA.com"]
start_urls = ["http://www.exampleA.com"]
rules = (Rule(SgmlLinkExtractor(allow=()), callback='parse_item', follow=True),)
def parse_start_url(self, response):
return self.parse_item(response)
def parse_item(self, response):
hxs = HtmlXPathSelector(response)
links = hxs.select('//a/@href').extract()
outgoing_links = []
for link in links:
if ("http://" in link):
base_url = urlparse(link).hostname
base_url = base_url.split(':')[0] # drop ports
base_url = '.'.join(base_url.split('.')[-2:]) # remove subdomains
url_hit = sum(1 for i in self.allowed_domains if base_url not in i)
if url_hit != 0:
outgoing_links.append(link)
if outgoing_links:
item = LinkNetworkItem()
item['internal_site'] = response.url
item['out_links'] = outgoing_links
return [item]
else:
return None
I want to extend this to multiple domains (exampleA.com, exampleB.com, exampleC.com ...). At first, I thought i can just add my list to start_urls
as well as allowed_domains
but in my opinion this causes following problems:
- Will the settings
DEPTH_LIMIT
be applied for eachstart_urls
/allowed_domain
? - More important: If the sites are connected will the spider jump from exampleA.com to exampleB.com because both are in allowed_domains? I need to avoid this criss-cross as I later on want to count the outbound links for each site to gain information about the relationship between the websites!
So how can i scale more spider without running into the problem of criss-crossing and using the settings per website?
Additional image showing what i would like to realize: