0
votes

I am trying to get the value of an element attribute from this site via importXML in Google Spreadsheet using XPath.

The attribute value i seek is content found in the <span> with itemprop="price".

<div class="left" style="margin-top: 10px;">
    <meta itemprop="currency" content="RON">
        <span class="pret" itemprop="price" content="698,31 RON">
            <p class="pret">Pretul tau:</p>
            698,31 RON
        </span>
...
</div>

I can access <div class="left"> but i can't get to the <span> element.

Tried using:

  • //span[@class='pret']/@content i get #N/A;
  • //span[@itemprop='price']/@content i get #N/A;
  • //div[@class='left']/span[@class='pret' and @itemprop='price']/@content i get #N/A;
  • //div[@class='left']/span[1]/@content i get #N/A;
  • //div[@class='left']/span/text() to get the text node of <span> i get #N/A;
  • //div[@class='left']//span/text() i get the text node of a <span> lower in div.left.

To get the text node of <span> i have to use //div[@class='left']/text(). But i can't use that text node because the layout of the span changes if a product is on sale, so i need the attribute.

It's like the span i'm looking for does not exist, although it appears in the development view of Chrome and in the page source and all XPath work in the console using $x("").

I tried to generate the XPath directly form the development tool by right clicking and i get //*[@id='produs']/div[4]/div[4]/div[1]/span which does not work. I also tried to generate the XPath with Firefox and plugins for FF and Chrome to no avail. The XPath generated in these ways did not even work on sites i managed to scrape with "hand coded XPath".

Now, the strangest thing is that on this other site with apparently similar code structure the XPath //span[@itemprop='price']/@content works.

I struggled with this for 4 days now. I'm starting to think it's something to do with the auto-closing meta tag, but why doesn't this happen on the other site?

3

3 Answers

2
votes

Perhaps the following formulas can help you:

=ImportXML("http://...","//div[@class='product-info-price']//div[@class='left']/text()")

Or

=INDEX(ImportXML("http://...","//div[@class='product-info-price']//div[@class='left']"), 1, 2)

UPDATE

It seems that not properly parse the entire document, it fails. A document extraction, something like:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html>
<div class="product-info-price">
    <div class="left" style="margin-top: 10px;">
        <meta itemprop="currency" content="RON">
        <span class="pret" itemprop="price" content="698,31 RON">
            <p class="pret">Pretul tau:</p>
            698,31 RON
        </span>
        <div class="resealed-info">
            <a href="/resigilate/componente-pc/placi-de-baza/" rel="nofollow">» Vezi 1 resigilat din aceasta categorie</a>
        </div>
        <ul style="margin-left: auto;margin-right: auto;width: 200px;text-align: center;margin-top: 20px;">
            <li style="color: #000000; font-size: 11px;">Rata de la <b>28,18 RON</b> prin <a href="http://www.marketonline.ro/rate-sapte-stele?amount=698.31#brdfinance" title="BRD Finance" target="_blank" class="rate" rel="nofollow">BRD</a></li>
            <li style="color: #5F5F5F;text-align: center;">Pretul include TVA</li>
            <li style="color: #5F5F5F;">Cod produs: <span style="margin-left: 0;text-align: center;font-weight: bold;" itemprop="identifier" content="mol:GA-Z87X-UD3H">GA-Z87X-UD3H</span> </li>
        </ul>
    </div>
    <div class="right" style="height: 103px;line-height: 103px;">
        <form action="/?a=shopping&amp;sa=addtocart" method="post" id="add_to_cart_form">
            <input type="hidden" name="product-183641" value="on"/>
            <a href="/adaugaincos-183641" rel="nofollow"><img src="/templates/marketonline/images/pag-prod/buton_cumpara.jpg"/></a>
        </form>
    </div>
</div>
</html>

works with the following XPath query:

"//div[@class='product-info-price']//div[@class='left']//span[@itemprop='price']/@content"

UPDATE

It occurs to me that one option is that you can use Apps Script to create your own ImportXML function, something like:

/* CODE FOR DEMONSTRATION PURPOSES */
function MyImportXML(url) {
  var found, html, content = '';
  var response = UrlFetchApp.fetch(url);
  if (response) {
    html = response.getContentText();
    if (html) content = html.match(/<span class="pret" itemprop="price" content="(.*)">/gi)[0].match(/content="(.*)"/i)[1];
  }
  return content;
}

Then you can use as follows:

=MyImportXML("http://...")
1
votes

At this time, the referred web page in the first link doesn't include a span tag with itemprop="price", but the following XPath returns 639

//b[@itemprop='price']

Looks to me that the problem was that the meta tag was not XHTML compliant but now all the meta tags are properly closed.

Before:

<meta itemprop="currency" content="RON">

Now

<meta itemprop="priceCurrency" content="RON" />

For web pages that are not XHTML compliant, instead of IMPORTXML another solution should be used, like using IMPORTDATA and REGEXEXTRACT or Google Apps Script, the UrlFetch Service and the match JavasScript function, among other alternatives.

0
votes

Try smth like this:

print 'content by key',tree.xpath('//*[@itemprop="price"]')[0].get('content')

or

nodes = tree.xpath('//div/meta/span')
for node in nodes:
    print 'content =',node.get('content')

But i haven't tried that.