Firstly, thanks for any and all help!
...new to stackoverflow (& python) si I apologize for using the wrong terminology :)
I'm using Scrapy to pull data from an html source, which creates dict fields in it's items.py by using Scrapy's selectors:
def parse_item(self, response):
item = SiennaautoItem() #instatiating dict
item['attributes'] = response.css('p.attrgroup').extract()
yield item
this returns a dict that has an array/list with multiple value:
> ['<p class="attrgroup">\n\n\n\n <span><b>2014 honda odyssey
> touring elite</b></span>\n <br>\n\n </p>', '<p
> class="attrgroup">\n\n\n\n <span>VIN:
> <b>5FNRL5H66EB107700</b></span>\n <br>\n\n\n\n\n
> <span>condition: <b>like new</b></span>\n <br>\n\n\n\n\n
> <span>cylinders: <b>6 cylinders</b></span>\n <br>\n\n\n\n\n
> <span>drive: <b>fwd</b></span>\n <br>\n\n\n\n\n
> <span>fuel: <b>gas</b></span>\n <br>\n\n\n\n\n
> <span>odometer: <b>99000</b></span>\n <br>\n\n\n\n\n
> <span>paint color: <b>white</b></span>\n <br>\n\n\n\n\n
> <span>size: <b>full-size</b></span>\n <br>\n\n\n\n\n
> <span>title status: <b>clean</b></span>\n <br>\n\n\n\n\n
> <span>transmission: <b>automatic</b></span>\n
> <br>\n\n\n\n\n <span>type: <b>mini-van</b></span>\n
> <br>\n\n </p>']
here's the rendered html:
['\n\n\n\n 2014 honda odyssey touring elite\n
', '\n\n\n\n VIN: 5FNRL5H66EB107700\n
\n\n
\n\n\n\n\n
condition: like new\n
\n\n\n\n\n
cylinders: 6 cylinders\n
\n\n\n\n\n drive: fwd\n
\n\n\n\n\n
fuel: gas\n
\n\n\n\n\n
odometer: 99000\n
\n\n\n\n\n
paint color: white\n
\n\n\n\n\n
size: full-size\n
\n\n\n\n\n
title status: clean\n
\n\n\n\n\n
transmission: automatic\n
\n\n\n\n\n type: mini-van\n
\n\n ']
My questions are, how can i remove the html tags and how can I create keys from the span tags, which are:
condition, drive, odometer, etc
I'd like for the values returned from item[attributes] to create their own dict values such as:
item[odometer] item[condition] etc
Thanks so much for the help as I have been stuck on this for a while!
item['attributes'] = response.css('p.attrgroup::text').extract()
will return only the text inside the element. To extract the values in the way you want you should loop between the elements and extract each one with their own XPath/CSS selector +::text
pseudo-element. Read more here. – renatodvcitem['MAKE'] = response.xpath('/html/body/section/section/section/div[1]/p[1]/span/b').extract()
but some users don't fill every attribute, so the returned value changes to a different <span> like 'condition'. How can I make sure that 'make' will always be 'make' if the elements within p.attrgroup vary?...I hope that makes sense and please let me know if I need to calrify! – Biilal Akhtarresponse.xpath("//ul@class='clg-info'/li[contains(.,'Ownership')]/span/text().extract()
– Biilal Akhtar