2
votes

I need to download a group of csv using scrapy from FTP. But first I need to scrape a website(https://www.douglas.co.us/assessor/data-downloads/) in order to get the urls of csv in the ftp.I read about how to download files in the documentation(Downloading and processing files and images)

settings

custom_settings = {
        'ITEM_PIPELINES': {
            'scrapy.pipelines.files.FilesPipeline': 1, 


        },
        'FILES_STORE' : os.path.dirname(os.path.abspath(__file__))
    }

parse

def parse(self, response):
        self.logger.info("In parse method!!!")
        # Property Ownership        
        property_ownership = response.xpath("//a[contains(., 'Property Ownership')]/@href").extract_first()

        # Property Location
        property_location = response.xpath("//a[contains(., 'Property Location')]/@href").extract_first()

        # Property Improvements
        property_improvements = response.xpath("//a[contains(., 'Property Improvements')]/@href").extract_first()

        # Property Value
        property_value = response.xpath("//a[contains(., 'Property Value')]/@href").extract_first()

        item = FiledownloadItem()
        self.insert_keyvalue(item,"file_urls",[property_ownership, property_location, property_improvements, property_value])

        yield item

But I got the following error

Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item requests = arg_to_iter(self.get_media_requests(item, info)) File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/files.py", line 382, in get_media_requests return [Request(x) for x in item.get(self.files_urls_field, [])] File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 25, in init self._set_url(url) File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: [

The best explanation to my problem is this answer of this question scrapy error :exceptions.ValueError: Missing scheme in request url:, that explain that the problem is that urls to download are missing the "http://".

What should I do in my case? Can I use FilesPipeline? or I need to do something different?

Thanks in advance.

1

1 Answers

2
votes

ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: [

According to the traceback, scrapy thinks your file url is '['.
My best guess is that you have an error in the insert_keyvalue() method.
Also, why have a method for this? Simple assignment should work.