I am trying to crawl all pages of a domain except those starting with /go.php, but I am at a loss at how to get Scrapy to understand that. I have tried this rule (which is the only rule defined in my CrawlSpider), but it still crawls URLs like domain.tld/go.php?key=value:
rules = [
Rule(SgmlLinkExtractor(allow=(
'.*'
), deny=(
'\\/go\\.php(.*)',
'go.php',
'go\.php',
'go\\.php',
'go.php(.*)',
'go\.php(.*)',
'go\\.php(.*)'
)))
]
The rule seems to get applied, because I get an exception when starting the spider with an obviously invalid regex (such as one with unbalanced parentheses).
Update
I am afraid I found the solution to my problem elsewhere. After rereading the documentation I noticed this warning: "When writing crawl spider rules, avoid using parse as callback, since the CrawlSpider uses the parse method itself to implement its logic. So if you override the parse method, the crawl spider will no longer work." - which unfortunately was exactly what I did. Renaming the parse
method to something else made Scrapy respect the rules. Sorry for that and thanks for all your answers, which pointed me in the right direction.
Maybe this helps someone else: the right regular expression turned out to be go\.php
without a slash in front.