1
votes

I have tried 3 different variations of LinkExtractor, but its still ignoring the 'deny' rule and crawling sub-domains in all 3 variations.... I want to EXCLUDE the sub-domains from the crawl.

Tried with 'allow' rule only. To only allow the main domain i.e. example.edu.uk

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working

Tried with 'deny' rule only. To deny all sub-domains i.e. sub.example.edu.uk

rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

Tried with both 'allow & deny' rule

rules = [Rule(LinkExtractor(allow=(r'^http:\/\/example\.edu\.uk(\/.*)?$'),deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

Example:

Follow these links

  • example.edu.uk/fsdfs.htm
  • example.edu.uk/nkln.htm
  • example.edu.uk/vefr.htm
  • example.edu.uk/opji.htm

Discard sub-domain links

  • sub-domain.example.edu.uk/fsdfs.htm
  • sub-domain.example.edu.uk/nkln.htm
  • sub-domain.example.edu.uk/vefr.htm
  • sub-domain.example.edu.uk/opji.htm

Here is the complete code ...

class NewsFields(Item):
    pagetype = Field()
    pagetitle = Field()
    pageurl = Field()
    pagedate = Field()
    pagedescription = Field()
    bodytext = Field()

class MySpider(CrawlSpider):
    name = 'profiles'
    start_urls = ['http://www.example.edu.uk/listing']
    allowed_domains = ['example.edu.uk']
    rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )
    def parse(self, response):
        hxs = Selector(response)
        soup = BeautifulSoup(response.body, 'lxml')
        nf = NewsFields()
        ptype = soup.find_all(attrs={"name":"nkdpagetype"})
        ptitle = soup.find_all(attrs={"name":"nkdpagetitle"})
        pturl = soup.find_all(attrs={"name":"nkdpageurl"})
        ptdate = soup.find_all(attrs={"name":"nkdpagedate"})
        ptdesc = soup.find_all(attrs={"name":"nkdpagedescription"})
        for node in soup.find_all("div", id="main-content__wrapper"):
             ptbody = ''.join(node.find_all(text=True))
             ptbody = ' '.join(ptbody.split())
             nf['pagetype'] = ptype[0]['content'].encode('ascii', 'ignore')
             nf['pagetitle'] = ptitle[0]['content'].encode('ascii', 'ignore')
             nf['pageurl'] = pturl[0]['content'].encode('ascii', 'ignore')
             nf['pagedate'] = ptdate[0]['content'].encode('ascii', 'ignore')
             nf['pagedescription'] = ptdesc[0]['content'].encode('ascii', 'ignore')
             nf['bodytext'] = ptbody.encode('ascii', 'ignore')
             yield nf
            for url in hxs.xpath('//p/a/@href').extract():
             yield Request(response.urljoin(url), callback=self.parse)

Can someone please help ? Thanks

1
Post some sample links also that you want to be processed and the ones that you don't want to be processedTarun Lalwani
Also please post when you say not working, what is that us happening? post logs if possibleTarun Lalwani
Hi @TarunLalwani What is it that you don't understand in my question ? All links in the main domain must be crawled and all links under sub-domains must be discarded. Anyway, I have updated the question. See above.Slyper

1 Answers

2
votes

Your first 2 rules are wrong

rules = [Rule(LinkExtractor(allow=(r'^example\.edu.uk(\/.*)?$',)))] // Not Working
rules = [Rule(LinkExtractor(deny=(r'(?<=\.)[a-z0-9-]*\.edu\.uk',)))] // Not Working

The allow and deny are for absolute urls and not domain. The below should work for you

rules = (Rule(LinkExtractor(allow=(r'^https?://example.edu.uk/.*', ))), )

Edit-1

First you should change below

allowed_domains = ['example.edu.uk']

to

allowed_domains = ['www.example.edu.uk']

Second your rules for extracting URL should be

rules = (Rule(LinkExtractor(allow=(r'^https?://www.example.edu.uk/.*', ))), )

Third, in your below code

for url in hxs.xpath('//p/a/@href').extract():
         yield Request(response.urljoin(url), callback=self.parse)

Rules will not be applied. Your yields are subject to rules. Rule will insert new request automatically, but they will not prevent you from yielding other links which are not allowed by the rule config. But setting allowed_domains will be applicable to both rules and your yield