0
votes

I'm trying to extract ingredients from this site (using python, scrapy and xpath only):
http://www.myrecipes.com/recipe/gin-orange-juice-braised-endives

I use the following xpath:

//*[@itemprop="recipeIngredient"]/descendant-or-self::*/text()

I need the ingredients as a list like this:

 ["3 tablespoons extra-virgin olive oil",
 "10 medium Belgian endives, halved lengthwise",
 "1/2 cup gin",
 "Salt and freshly ground black pepper"
 ...]

But it gives me a lot of spaces inside:

[u'\n  ', u'3 tablespoons', u'\n  ', u' \n                extra-virgin olive oil\n             ', u'\n  ', u' ', u'\n', u'\n  ', u'10 ', u'\n  ', u' \n                medium Belgian endives, halved lengthwise\n             ', u'\n  ', u' ', u'\n', u'\n  ', u'1/2 cup', u'\n  ', u' \n                gin\n             ', u'\n  ', u' ', u'\n', u'\n  ', u' ', u'\n  ', u' \n                Salt and freshly ground black pepper\n             ', u'\n  ', u' ', u'\n', u'\n  ', u'1 cup', u'\n  ', u' \n                fresh orange juice\n             ', u'\n  ', u' ', u'\n', u'\n  ', u'4 tablespoons', u'\n  ', u' \n                unsalted butter\n             ', u'\n  ', u' ', u'\n', u'\n  ', u'2 tablespoons', u'\n  ', u' \n                honey\n             ', u'\n  ', u' ', u'\n', u'\n  ', u'2 ', u'\n  ', u' \n                scallions, white and pale green parts only, thinly sliced\n             ', u'\n  ', u' ', u'\n', u'\n  ', u'2 tablespoons', u'\n  ', u' \n                salted roasted pumpkin seeds\n             ', u'\n  ', u' ', u'\n', u'\n  ', u' ', u'\n  ', u' \n                Balsamic vinegar, for drizzling\n             ', u'\n  ', u' ', u'\n']

After stripping each item with python (2.7):

 ["3 tablespoons",
 "extra-virgin olive oil",
 "10",
 "medium Belgian endives, halved lengthwise",
 "1/2 cup",
 "gin",
 "Salt and freshly ground black pepper",
 ...]

Each of the ingredients is inside a div, like this:

<div itemprop="recipeIngredient"  >
  <span>3 tablespoons</span>
  <span> 
                extra-virgin olive oil
             </span>
  <span> </span>
</div>

If I use normalize-text, like this:

normalize-space(//*[@itemprop="recipeIngredient"])

I get only this:

3 tablespoons extra-virgin olive oil

which is amazing, but I need all the divs and not only the first one.

Any help would be appreciated.

3

3 Answers

0
votes

Try to use below XPath expression:

//div[@itemprop="recipeIngredient"]/string(normalize-space())
0
votes

I had to use python in the end, with a bit of looping and more xpath on the original xpath:

if response.xpath('//*[@itemprop="recipeIngredient"]'):
    ingredients = []
    for item in response.xpath('//div[@itemprop="recipeIngredient"]'):
        item = item.xpath("span/text()").extract()
        item = " ".join([" ".join(elem.split()) for elem in item])
        ingredients.append(item)

    raw_recipe["ingredients"] = ingredients

Result (with an extra space, but I don't mind):

["3 tablespoons extra-virgin olive oil ", "10 medium Belgian endives, halved lengthwise ", "1/2 cup gin ", " Salt and freshly ground black pepper ", "1 cup fresh orange juice ", "4 tablespoons unsalted butter ", "2 tablespoons honey ", "2 scallions, white and pale green parts only, thinly sliced ", "2 tablespoons salted roasted pumpkin seeds ", " Balsamic vinegar, for drizzling "]
-1
votes

Use the below jQuery script:

var $=jQuery;
var list=[];
$('.field-ingredients').each(function () {
  var ingredient=[]
  $(this).find('span').each(function () {
    ingredient.push($(this).text().trim());    
  });
  list.push(ingredient.join(" ").trim());
});

console.log(list);