3
votes

Hi I want to crawl all the pages of a web using Scrapy CrawlSpider class (Documentation here).

class MySpider(CrawlSpider):
    name = 'abc.com'
    allowed_domains = ['abc.com']
    start_urls = ['http://www.abc.com']

    rules = (
        Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item')
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        ...

(1) So, this spider will start from page: www.abc.com which is defined in the start_urls, and it will automatically do the parsing... and then follow every single link in the www.abc.com which follows the rule right? I am wondering is there a way so I could only scrape a certain number of layers.. say only scrape the 1st layer (links directly derived from www.abc.com)?

(2) Since i have defined in the allowed_deomains that only abc.com urls would be scraped. So I don't need to redefine that in the rules? and do something like this:

Rule(SgmlLinkExtractor(allow=('item\.php', )), allow_domains="www.abc.com", callback='parse_item')

(3) If I am using crawlspider, what will happen if I don't define rules in the spider class? it will crawl follow all the pages? or it would not even follow any single one because the rule has not been 'met'?

1

1 Answers

3
votes
  1. Set DEPTH_LIMIT setting:

    DEPTH_LIMIT¶

    Default: 0

    The maximum depth that will be allowed to crawl for any site. If zero, no limit will be imposed.

  2. No, you don't need to add an additional url check. If you don't specify allow_domains on the Rule level, it will extract only URLs with abc.com domain.

  3. If you don't define rules it won't extract any URLs (will work like a BaseSpider).

Hope that helps.