How to crawl an entire website with Scrapy?

Question

I'm unable to crawl a whole website, Scrapy just crawls at the surface, I want to crawl deeper. Been googling for the last 5-6 hours and no help. My code below:

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.selector import HtmlXPathSelector
from scrapy.item import Item
from scrapy.spider import BaseSpider
from scrapy import log

class ExampleSpider(CrawlSpider):
    name = "example.com"
    allowed_domains = ["example.com"]
    start_urls = ["http://www.example.com/"]
    rules = [Rule(SgmlLinkExtractor(allow=()), 
                  follow=True),
             Rule(SgmlLinkExtractor(allow=()), callback='parse_item')
    ]
    def parse_item(self,response):
        self.log('A response from %s just arrived!' % response.url)

Just tried your code against stackoverflow - my ip got banned. It definitely works! :) — alecxe
@Alexander - Sounds encouraging for me to debug more :) :) ... Sorry on the IP ban mate ! — Abhiram Sampath
Are you really trying to crawl example.com? You know that's not a real website. — Steven Almeroth
"example.com" was used for representative purposes only. I'm trying to crawl landmarkshops.com — Abhiram Sampath

Steven Almeroth Steven Almeroth · Accepted Answer · 2013-03-20T00:36:16

Rules short-circuit, meaning that the first rule a link satisfies will be the rule that gets applied, your second Rule (with callback) will not be called.

Change your rules to this:

rules = [Rule(SgmlLinkExtractor(), callback='parse_item', follow=True)]

How to crawl an entire website with Scrapy?

2 Answers