1
votes

I'm trying to extract all the text in this html that is inside the itemprop="ingredients".

I saw this answer, and it's exactly what I want, but there are elements specified, and my text is not nested inside.

This is the html:

<li itemprop="ingredients">Beginning of ingredient
     <a href="some-link" data-ct-category="Other"
     data-ct-action="Site Search"
     data-ct-information="Recipe Search - Hellmann's® or Best Foods® Real Mayonnaise"
     data-ct-attr="some_attr">Rest of Ingredient</a>
</li>   
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>

What I need is to get the text back as a list, and the first element on this list will be "Beginning of ingredient insert space here, join or somethingRest of Ingredient", and the other elements will be "Another ingredient".

I got close with:

for row in response.xpath('//*[@itemprop="ingredients"]/descendant-or-self::*/text()'):
...      print row.extract()
...
Beginning of ingredient
Rest of Ingredient

    Another ingredient
    Another ingredient
    Another ingredient
    Another ingredient
    Another ingredient

So when I put it in a list by using extract_first() on each row, I get this:

 ['Beginning of ingredient', "Rest of Ingredient", 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']

But I want this:

 ['Beginning of ingredient Rest of Ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']
1

1 Answers

0
votes

You are close, get over every li element and then call context-specific descendant-or-self:

In [1]: [" ".join(map(unicode.strip, item.xpath("descendant-or-self::text()").extract())) 
         for item in response.xpath('//li[@itemprop="ingredients"]')]
Out[1]: 
[u'Beginning of ingredient Rest of Ingredient ',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient',
 u'Another ingredient']