Hi I want to crawl all the pages of a web using Scrapy CrawlSpider class (Documentation here).
class MySpider(CrawlSpider):
name = 'abc.com'
allowed_domains = ['abc.com']
start_urls = ['http://www.abc.com']
rules = (
Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item')
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
...
(1) So, this spider will start from page: www.abc.com which is defined in the start_urls, and it will automatically do the parsing... and then follow every single link in the www.abc.com which follows the rule right? I am wondering is there a way so I could only scrape a certain number of layers.. say only scrape the 1st layer (links directly derived from www.abc.com)?
(2) Since i have defined in the allowed_deomains that only abc.com urls would be scraped. So I don't need to redefine that in the rules? and do something like this:
Rule(SgmlLinkExtractor(allow=('item\.php', )), allow_domains="www.abc.com", callback='parse_item')
(3) If I am using crawlspider, what will happen if I don't define rules in the spider class? it will crawl follow all the pages? or it would not even follow any single one because the rule has not been 'met'?