I'm trying to extract ingredients from this site (using python, scrapy and xpath only):
http://www.myrecipes.com/recipe/gin-orange-juice-braised-endives
I use the following xpath:
//*[@itemprop="recipeIngredient"]/descendant-or-self::*/text()
I need the ingredients as a list like this:
["3 tablespoons extra-virgin olive oil",
"10 medium Belgian endives, halved lengthwise",
"1/2 cup gin",
"Salt and freshly ground black pepper"
...]
But it gives me a lot of spaces inside:
[u'\n ', u'3 tablespoons', u'\n ', u' \n extra-virgin olive oil\n ', u'\n ', u' ', u'\n', u'\n ', u'10 ', u'\n ', u' \n medium Belgian endives, halved lengthwise\n ', u'\n ', u' ', u'\n', u'\n ', u'1/2 cup', u'\n ', u' \n gin\n ', u'\n ', u' ', u'\n', u'\n ', u' ', u'\n ', u' \n Salt and freshly ground black pepper\n ', u'\n ', u' ', u'\n', u'\n ', u'1 cup', u'\n ', u' \n fresh orange juice\n ', u'\n ', u' ', u'\n', u'\n ', u'4 tablespoons', u'\n ', u' \n unsalted butter\n ', u'\n ', u' ', u'\n', u'\n ', u'2 tablespoons', u'\n ', u' \n honey\n ', u'\n ', u' ', u'\n', u'\n ', u'2 ', u'\n ', u' \n scallions, white and pale green parts only, thinly sliced\n ', u'\n ', u' ', u'\n', u'\n ', u'2 tablespoons', u'\n ', u' \n salted roasted pumpkin seeds\n ', u'\n ', u' ', u'\n', u'\n ', u' ', u'\n ', u' \n Balsamic vinegar, for drizzling\n ', u'\n ', u' ', u'\n']
After stripping each item with python (2.7):
["3 tablespoons",
"extra-virgin olive oil",
"10",
"medium Belgian endives, halved lengthwise",
"1/2 cup",
"gin",
"Salt and freshly ground black pepper",
...]
Each of the ingredients is inside a div, like this:
<div itemprop="recipeIngredient" >
<span>3 tablespoons</span>
<span>
extra-virgin olive oil
</span>
<span> </span>
</div>
If I use normalize-text, like this:
normalize-space(//*[@itemprop="recipeIngredient"])
I get only this:
3 tablespoons extra-virgin olive oil
which is amazing, but I need all the divs and not only the first one.
Any help would be appreciated.