I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?
I read about a parameter -a
somewhere but have no idea how to use it.
Spider arguments are passed in the crawl
command using the -a
option. For example:
scrapy crawl myspider -a category=electronics -a domain=system
Spiders can access arguments as attributes:
class MySpider(scrapy.Spider):
name = 'myspider'
def __init__(self, category='', **kwargs):
self.start_urls = [f'http://www.example.com/{category}'] # py36
super().__init__(**kwargs) # python3
def parse(self, response)
self.log(self.domain) # system
Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments
Update 2013: Add second argument
Update 2015: Adjust wording
Update 2016: Use newer base class and add super, thanks @Birla
Update 2017: Use Python3 super
# previously
super(MySpider, self).__init__(**kwargs) # python2
Update 2018: As @eLRuLL points out, spiders can access arguments as attributes
Previous answers were correct, but you don't have to declare the constructor (__init__
) every time you want to code a scrapy's spider, you could just specify the parameters as before:
scrapy crawl myspider -a parameter1=value1 -a parameter2=value2
and in your spider code you can just use them as spider arguments:
class MySpider(Spider):
name = 'myspider'
...
def parse(self, response):
...
if self.parameter1 == value1:
# this is True
# or also
if getattr(self, parameter2) == value2:
# this is also True
And it just works.
To pass arguments with crawl command
scrapy crawl myspider -a category='mycategory' -a domain='example.com'
To pass arguments to run on scrapyd replace -a with -d
curl http://your.ip.address.here:port/schedule.json -d spider=myspider -d category='mycategory' -d domain='example.com'
The spider will receive arguments in its constructor.
class MySpider(Spider):
name="myspider"
def __init__(self,category='',domain='', *args,**kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.category = category
self.domain = domain
Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.
class MySpider(Spider):
name="myspider"
start_urls = ('https://httpbin.org/ip',)
def parse(self,response):
print getattr(self,'category','')
print getattr(self,'domain','')
Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-
scrapy crawl myspider -a domain="http://www.example.com"
And receive arguments in spider's constructors:
class MySpider(BaseSpider):
name = 'myspider'
def __init__(self, domain='', *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [domain]
#
...
it will work :)
Alternatively we can use ScrapyD which expose an API where we can pass the start_url and spider name. ScrapyD has api's to stop/start/status/list the spiders.
pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default
scrapyd-deploy
will deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.
class MySpider(CrawlSpider):
def __init__(self, start_urls, *args, **kwargs):
self.start_urls = start_urls.split('|')
super().__init__(*args, **kwargs)
name = testspider
curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"
Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API
Refer scrapyd API documentation for more details