I need to download a group of csv using scrapy from FTP. But first I need to scrape a website(https://www.douglas.co.us/assessor/data-downloads/) in order to get the urls of csv in the ftp.I read about how to download files in the documentation(Downloading and processing files and images)
settings
custom_settings = {
'ITEM_PIPELINES': {
'scrapy.pipelines.files.FilesPipeline': 1,
},
'FILES_STORE' : os.path.dirname(os.path.abspath(__file__))
}
parse
def parse(self, response):
self.logger.info("In parse method!!!")
# Property Ownership
property_ownership = response.xpath("//a[contains(., 'Property Ownership')]/@href").extract_first()
# Property Location
property_location = response.xpath("//a[contains(., 'Property Location')]/@href").extract_first()
# Property Improvements
property_improvements = response.xpath("//a[contains(., 'Property Improvements')]/@href").extract_first()
# Property Value
property_value = response.xpath("//a[contains(., 'Property Value')]/@href").extract_first()
item = FiledownloadItem()
self.insert_keyvalue(item,"file_urls",[property_ownership, property_location, property_improvements, property_value])
yield item
But I got the following error
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/media.py", line 79, in process_item requests = arg_to_iter(self.get_media_requests(item, info)) File "/usr/local/lib/python2.7/dist-packages/scrapy/pipelines/files.py", line 382, in get_media_requests return [Request(x) for x in item.get(self.files_urls_field, [])] File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 25, in init self._set_url(url) File "/usr/local/lib/python2.7/dist-packages/scrapy/http/request/init.py", line 58, in _set_url raise ValueError('Missing scheme in request url: %s' % self._url) ValueError: Missing scheme in request url: [
The best explanation to my problem is this answer of this question scrapy error :exceptions.ValueError: Missing scheme in request url:, that explain that the problem is that urls to download are missing the "http://".
What should I do in my case? Can I use FilesPipeline? or I need to do something different?
Thanks in advance.