7
votes

I can't change spider settings in parse method. But it is definitely must be a way.

For example:

class SomeSpider(BaseSpider):
    name = 'mySpider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']
    settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.FirstPipeline']
    print settings['ITEM_PIPELINES'][0]
    #printed 'myproject.pipelines.FirstPipeline'
    def parse(self, response):
        #...some code
        settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.SecondPipeline']
        print settings['ITEM_PIPELINES'][0]
        # printed 'myproject.pipelines.SecondPipeline'
        item = Myitem()
        item['mame'] = 'Name for SecondPipeline'  

But! Item will be processed by FirstPipeline. New ITEM_PIPELINES param don't work. How can I change settings after start crawling? Thanks in advance!

2
Pipelines are initialized and activated on engine start. I am not sure if you can change this during execution. You could however activate both pipelines at start, and add some logic in the pipeline which will only process an item if it meets a certain condition.Sjaak Trekhaak
Yep, this is my last resort option. Thank you for reply. I think that something like spider signals can help, but it is rather diffcult.fcmax
For sure, you can attach various functions to various spider signals. You'd want to attach handlers to signals in an extension tho. See also: doc.scrapy.org/en/latest/topics/…Sjaak Trekhaak
Finaly I've added item['flag']='some_flag' in spider and some condition in pipelines. It looks a better way in this way. Thank you for updates.fcmax
It would probably be far more efficient to use a property of the spider as the flag. The spider is passed along with the item to the pipeline: def process_item(self, item, spider): So you can use the spider to set the flag, if you use the item you will end up using far more memory.mand

2 Answers

3
votes

If you want that different spiders to have different pipelines you can set for a spider a pipelines list attribute which defines the pipelines for that spider. Than in pipelines check for existence:

class MyPipeline(object):

    def process_item(self, item, spider):
        if self.__class__.__name__ not in getattr(spider, 'pipelines',[]):
            return item
        ...
        return item

class MySpider(CrawlSpider):
    pipelines = set([
        'MyPipeline',
        'MyPipeline3',
    ])

If you want that different items to be proceesed by different pipelines you can do this:

    class MyPipeline2(object):
        def process_item(self, item, spider):
            if isinstance(item, MyItem):
                ...
                return item
            return item
0
votes

Based on this informative issue#4196 combined with the telnet console it is possible to do it, even post-execution.

Attach a telnet client to the port (e.g. 1234) & password logged when scrapy crawl command is launched, and issue the following interactive Python statements to modify the currently running downloader:

$ telnet  127.0.0.1  6023  # Read the actual port from logs.
Trying 127.0.0.1...
Connected to 127.0.0.1.
Escape character is '^]'.
Username: scrapy
Password: <copy-from-logs>
>>> engine.downloader.total_concurrency
8
>>> engine.downloader.total_concurrency = 32
>>> est()
Execution engine status

time()-engine.start_time                        : 14226.62803554535
engine.has_capacity()                           : False
len(engine.downloader.active)                   : 28
engine.scraper.is_idle()                        : False
engine.spider.name                              : <foo>
engine.spider_is_idle(engine.spider)            : False
engine.slot.closing                             : False
len(engine.slot.inprogress)                     : 32
len(engine.slot.scheduler.dqs or [])            : 531
len(engine.slot.scheduler.mqs)                  : 0
len(engine.scraper.slot.queue)                  : 0
len(engine.scraper.slot.active)                 : 0
engine.scraper.slot.active_size                 : 0
engine.scraper.slot.itemproc_size               : 0
engine.scraper.slot.needs_backout()             : False