Scrapy: removing html from an array returned to items.py dictionary

Question

Firstly, thanks for any and all help!

...new to stackoverflow (& python) si I apologize for using the wrong terminology :)

I'm using Scrapy to pull data from an html source, which creates dict fields in it's items.py by using Scrapy's selectors:

def parse_item(self, response):

    item = SiennaautoItem() #instatiating dict
    item['attributes'] = response.css('p.attrgroup').extract()

    yield item

this returns a dict that has an array/list with multiple value:

> ['<p class="attrgroup">\n\n\n\n            <span><b>2014 honda odyssey
> touring elite</b></span>\n            <br>\n\n    </p>', '<p
> class="attrgroup">\n\n\n\n            <span>VIN:
> <b>5FNRL5H66EB107700</b></span>\n            <br>\n\n\n\n\n           
> <span>condition: <b>like new</b></span>\n            <br>\n\n\n\n\n   
> <span>cylinders: <b>6 cylinders</b></span>\n            <br>\n\n\n\n\n
> <span>drive: <b>fwd</b></span>\n            <br>\n\n\n\n\n           
> <span>fuel: <b>gas</b></span>\n            <br>\n\n\n\n\n           
> <span>odometer: <b>99000</b></span>\n            <br>\n\n\n\n\n       
> <span>paint color: <b>white</b></span>\n            <br>\n\n\n\n\n    
> <span>size: <b>full-size</b></span>\n            <br>\n\n\n\n\n       
> <span>title status: <b>clean</b></span>\n            <br>\n\n\n\n\n   
> <span>transmission: <b>automatic</b></span>\n           
> <br>\n\n\n\n\n            <span>type: <b>mini-van</b></span>\n        
> <br>\n\n    </p>']

here's the rendered html:

['\n\n\n\n 2014 honda odyssey touring elite\n
\n\n
', '\n\n\n\n VIN: 5FNRL5H66EB107700\n
\n\n\n\n\n
condition: like new\n
\n\n\n\n\n
cylinders: 6 cylinders\n
\n\n\n\n\n drive: fwd\n
\n\n\n\n\n
fuel: gas\n
\n\n\n\n\n
odometer: 99000\n
\n\n\n\n\n
paint color: white\n
\n\n\n\n\n
size: full-size\n
\n\n\n\n\n
title status: clean\n
\n\n\n\n\n
transmission: automatic\n

\n\n\n\n\n type: mini-van\n

\n\n ']

My questions are, how can i remove the html tags and how can I create keys from the span tags, which are:

condition, drive, odometer, etc

I'd like for the values returned from item[attributes] to create their own dict values such as:

item[odometer] item[condition] etc

Thanks so much for the help as I have been stuck on this for a while!

item['attributes'] = response.css('p.attrgroup::text').extract() will return only the text inside the element. To extract the values in the way you want you should loop between the elements and extract each one with their own XPath/CSS selector + ::text pseudo-element. Read more here. — renatodvc
so i tried to do that, for example item['MAKE'] = response.xpath('/html/body/section/section/section/div[1]/p[1]/span/b').extract() but some users don't fill every attribute, so the returned value changes to a different <span> like 'condition'. How can I make sure that 'make' will always be 'make' if the elements within p.attrgroup vary?...I hope that makes sense and please let me know if I need to calrify! — Biilal Akhtar
I think this link might have the answer but I'd really appreciate it if you could help explain how to use the 'contains' way...Thanks. The part im talking about is: response.xpath("//ul@class='clg-info'/li[contains(.,'Ownership')]/span/text().extract() — Biilal Akhtar
It's hard to help you without seeing the HTML and/or your code. Generally speaking you need to find a selector xpath/css that selects a list of similar elements, which have your fields as it's childs. So you can iterate over the list and extract the data inside the loop in a more generic way. — renatodvc

dimitris_r dimitris_r · Accepted Answer · 2020-09-13T05:02:37

my xpath is a little be rusty but here is a way to to do it without using xpath, just use the w3lib library

from w3lib.html import remove_tags,replace_escape_chars

html_array=['<p class="attrgroup">\n\n\n\n            <span><b>2014 honda odyssey > touring elite</b></span>\n            <br>\n\n    </p>', '<p > class="attrgroup">\n\n\n\n            <span>VIN: > <b>5FNRL5H66EB107700</b></span>\n            <br>\n\n\n\n\n            > <span>condition: <b>like new</b></span>\n            <br>\n\n\n\n\n    > <span>cylinders: <b>6 cylinders</b></span>\n            <br>\n\n\n\n\n > <span>drive: <b>fwd</b></span>\n            <br>\n\n\n\n\n            > <span>fuel: <b>gas</b></span>\n            <br>\n\n\n\n\n            > <span>odometer: <b>99000</b></span>\n            <br>\n\n\n\n\n        > <span>paint color: <b>white</b></span>\n            <br>\n\n\n\n\n     > <span>size: <b>full-size</b></span>\n            <br>\n\n\n\n\n        > <span>title status: <b>clean</b></span>\n            <br>\n\n\n\n\n    > <span>transmission: <b>automatic</b></span>\n            > <br>\n\n\n\n\n            <span>type: <b>mini-van</b></span>\n         > <br>\n\n    </p>']
html=replace_escape_chars(' '.join(list(map(lambda x:remove_tags(x),html_array))))


data={}
for i in html.split('>'):
    splitted_content = list(map(lambda x:x.strip(),i.split(":")))
    if splitted_content[0].replace(':','').strip() in ['condition','cylinders','drive','fuel']: #put in this array the elements you need
        data[splitted_content[0]]=splitted_content[1]

print(data)

output:

   
{'condition': 'like new', 'cylinders': '6 cylinders', 'drive': 'fwd', 'fuel': 'gas'}

Scrapy: removing html from an array returned to items.py dictionary

1 Answers