I'm trying to extract all the text in this html that is inside the itemprop="ingredients".
I saw this answer, and it's exactly what I want, but there are elements specified, and my text is not nested inside.
This is the html:
<li itemprop="ingredients">Beginning of ingredient
<a href="some-link" data-ct-category="Other"
data-ct-action="Site Search"
data-ct-information="Recipe Search - Hellmann's® or Best Foods® Real Mayonnaise"
data-ct-attr="some_attr">Rest of Ingredient</a>
</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
<li itemprop="ingredients">Another ingredient</li>
What I need is to get the text back as a list, and the first element on this list will be "Beginning of ingredient insert space here, join or somethingRest of Ingredient", and the other elements will be "Another ingredient".
I got close with:
for row in response.xpath('//*[@itemprop="ingredients"]/descendant-or-self::*/text()'):
... print row.extract()
...
Beginning of ingredient
Rest of Ingredient
Another ingredient
Another ingredient
Another ingredient
Another ingredient
Another ingredient
So when I put it in a list by using extract_first() on each row, I get this:
['Beginning of ingredient', "Rest of Ingredient", 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']
But I want this:
['Beginning of ingredient Rest of Ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient', 'Another ingredient']