I've put together a spider and it was running as intended until I've added the keyword deny
into the rules.
This is my spider :
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
from bhg.items import BhgItem
class BhgSpider (CrawlSpider):
name = 'bhg'
start_urls = ['http://www.bhg.com/holidays/st-patricks-day/']
rules = (Rule(LinkExtractor(allow=[r'/*'], ),
deny=('blogs/*', 'videos/*', ),
callback='parse_html'), )
def parse_html(self, response):
hxs = Selector(response)
item = BhgItem()
item['title'] = hxs.xpath('//title/text()').extract()
item['h1'] = hxs.xpath('//h1/text()').extract()
item['canonical'] = hxs.xpath('//link[@rel = \"canonical\"]/@href').extract()
item['meta_desc'] = hxs.xpath('//meta[@name=\"description"]/@content').extract()
item['url'] = response.request.url
item['status_code'] = response.status
return item
When I run this code I get:
deny=('blogs/', 'videos/', ),), )
TypeError: __init__() got an unexpected keyword argument 'deny'
What am i doing wrong? Well, I guess a function or something was not expecting the extra argument (deny
) but which function? parse_html()
?
I did not define any other spiders and there is no __init__()