Scrapy Spider Xpath Image Url

Question

I have a scrapy spider that receives the input of a desired keyword and then yields a search result url. It then crawls that URL to scrape desired values about each car result within 'item'. I am trying to add within my yielded items the url for each full sized car image link that accompanies each car on the vehicle list of results.

The specific url that is being crawled when I enter the keyword as being "honda" is the following: Honda search results example

I have been having trouble figuring out the correct way to write the xpath and then include whatever list of image url's I acquire into the spider's 'item' I yield at the last part of my code. Right now when Items is saved to a .csv file with the below lkq.py spider being run with the command "scrapy crawl lkq -o items.csv -t csv" the column of the items.csv file for Picture is just all zeros instead of the image url's.

# -*- coding: utf-8 -*-
from __future__ import unicode_literals
import scrapy
from scrapy.shell import inspect_response
from scrapy.utils.response import open_in_browser

keyword = raw_input('Keyword: ')
url =     'http://www.lkqpickyourpart.com/DesktopModules/pyp_vehicleInventory/getVehicleInventory.aspx?store=224&page=0&filter=%s&sp=&cl=&carbuyYardCode=1224&pageSize=1000&language=en-US' % (keyword,)
class Cars(scrapy.Item):
Make = scrapy.Field()
Model = scrapy.Field()
Year = scrapy.Field()
Entered_Yard = scrapy.Field()
Section = scrapy.Field()
Color = scrapy.Field()
Picture = scrapy.Field()


class LkqSpider(scrapy.Spider):
name = "lkq"
allowed_domains = ["lkqpickyourpart.com"]
start_urls = (
    url,
)

def parse(self, response):
    picture = response.xpath(
        '//href=/text()').extract()
    section_color = response.xpath(
        '//div[@class="pypvi_notes"]/p/text()').extract()
    info = response.xpath('//td["pypvi_make"]/text()').extract()
    for element in range(0, len(info), 4):
        item = Cars()
        item["Make"] = info[element]
        item["Model"] = info[element + 1]
        item["Year"] = info[element + 2]
        item["Entered_Yard"] = info[element + 3]
        item["Section"] = section_color.pop(
            0).replace("Section:", "").strip()
        item["Color"] = section_color.pop(0).replace("Color:",   "").strip()
        item["Picture"] = picture.pop(0).strip()
        yield item

eLRuLL eLRuLL · Accepted Answer · 2016-05-20T17:39:42

I don't really understand why you were using an xpath like '//href=/text()', I would recommend reading some xpath tutorial first, here is a very good one.

If you want to get all the images urls I think this is what you want

pictures = response.xpath('//img/@src').extract()

Now picture.pop(0).strip() will only get you the last of the urls and strip it, remember that .extract() returns a list, so pictures now contains all the image links, just choose there which ones you need.

Scrapy Spider Xpath Image Url

1 Answers