Scrapy. How to change spider settings after start crawling?

Question

I can't change spider settings in parse method. But it is definitely must be a way.

For example:

class SomeSpider(BaseSpider):
    name = 'mySpider'
    allowed_domains = ['example.com']
    start_urls = ['http://example.com']
    settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.FirstPipeline']
    print settings['ITEM_PIPELINES'][0]
    #printed 'myproject.pipelines.FirstPipeline'
    def parse(self, response):
        #...some code
        settings.overrides['ITEM_PIPELINES'] = ['myproject.pipelines.SecondPipeline']
        print settings['ITEM_PIPELINES'][0]
        # printed 'myproject.pipelines.SecondPipeline'
        item = Myitem()
        item['mame'] = 'Name for SecondPipeline'

But! Item will be processed by FirstPipeline. New ITEM_PIPELINES param don't work. How can I change settings after start crawling? Thanks in advance!

Pipelines are initialized and activated on engine start. I am not sure if you can change this during execution. You could however activate both pipelines at start, and add some logic in the pipeline which will only process an item if it meets a certain condition. — Sjaak Trekhaak
Yep, this is my last resort option. Thank you for reply. I think that something like spider signals can help, but it is rather diffcult. — fcmax
For sure, you can attach various functions to various spider signals. You'd want to attach handlers to signals in an extension tho. See also: doc.scrapy.org/en/latest/topics/… — Sjaak Trekhaak
Finaly I've added item['flag']='some_flag' in spider and some condition in pipelines. It looks a better way in this way. Thank you for updates. — fcmax
It would probably be far more efficient to use a property of the spider as the flag. The spider is passed along with the item to the pipeline: def process_item(self, item, spider): So you can use the spider to set the flag, if you use the item you will end up using far more memory. — mand

sergiuz sergiuz · Accepted Answer · 2015-09-26T21:23:39

If you want that different spiders to have different pipelines you can set for a spider a pipelines list attribute which defines the pipelines for that spider. Than in pipelines check for existence:

class MyPipeline(object):

    def process_item(self, item, spider):
        if self.__class__.__name__ not in getattr(spider, 'pipelines',[]):
            return item
        ...
        return item

class MySpider(CrawlSpider):
    pipelines = set([
        'MyPipeline',
        'MyPipeline3',
    ])

If you want that different items to be proceesed by different pipelines you can do this:

    class MyPipeline2(object):
        def process_item(self, item, spider):
            if isinstance(item, MyItem):
                ...
                return item
            return item

Scrapy. How to change spider settings after start crawling?

2 Answers