How to keep Scrapy from crawling "denied" pages

Question

I am trying to crawl all pages of a domain except those starting with /go.php, but I am at a loss at how to get Scrapy to understand that. I have tried this rule (which is the only rule defined in my CrawlSpider), but it still crawls URLs like domain.tld/go.php?key=value:

rules = [
    Rule(SgmlLinkExtractor(allow=(
        '.*'
    ), deny=(
        '\\/go\\.php(.*)',
        'go.php',
        'go\.php',
        'go\\.php',
        'go.php(.*)',
        'go\.php(.*)',
        'go\\.php(.*)'
    )))
]

The rule seems to get applied, because I get an exception when starting the spider with an obviously invalid regex (such as one with unbalanced parentheses).

Update

I am afraid I found the solution to my problem elsewhere. After rereading the documentation I noticed this warning: "When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work." - which unfortunately was exactly what I did. Renaming the parse method to something else made Scrapy respect the rules. Sorry for that and thanks for all your answers, which pointed me in the right direction.

Maybe this helps someone else: the right regular expression turned out to be go\.php without a slash in front.

R. Max R. Max · Accepted Answer · 2014-01-11T06:09:38

Are you sure the actual href value is that one? It looks like might be javascript-generated.

You can run scrapy shell "http://website/page?foo&bar" to inspect the page and play with the allow/deny parameters. Also you can test the link extractor against an arbitrary html to see how it works.

In [1]: html = """
  ...: <a href="http://domain.tld/go.php?key=value">go</a>
  ...: <a href="/go.php?key=value2">go2</a>
  ...: <a href="/index.html">index</a>
  ...: """

In [2]: from scrapy.http import HtmlResponse

In [3]: response = HtmlResponse('http://example.com/', body=html)

In [4]: from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

In [5]: lx = SgmlLinkExtractor()

In [6]: lx.extract_links(response)
Out[6]: 
[Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False),
Link(url='http://example.com/go.php?key=value2', text=u'go2', fragment='', nofollow=False),
Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]

In [8]: SgmlLinkExtractor(allow='go\.php').extract_links(response)
Out[8]: 
[Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False),
Link(url='http://example.com/go.php?key=value2', text=u'go2', fragment='', nofollow=False)]

In [9]: SgmlLinkExtractor(deny='go\.php').extract_links(response)
Out[9]: [Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]

In [10]: SgmlLinkExtractor(allow=('key=', 'index'), deny=('value2', )).extract_links(response)
Out[10]: 
[Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False),
Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]

In [11]: SgmlLinkExtractor(allow='domain\.tld').extract_links(response)
Out[11]: [Link(url='http://domain.tld/go.php?key=value', text=u'go', fragment='', nofollow=False)]

In [12]: SgmlLinkExtractor(allow='example.com').extract_links(response)
Out[12]: 
[Link(url='http://example.com/go.php?key=value2', text=u'go2', fragment='', nofollow=False),
Link(url='http://example.com/index.html', text=u'index', fragment='', nofollow=False)]

How to keep Scrapy from crawling "denied" pages

Update

1 Answers