ImportXML and Google Spreadsheet issue

Question

I'm 'scraping' a few product descriptions from a website and bringing them into a google spreadsheet using importXML.

It has gone fairly smoothly, but there is one major snag that I would love to correct, and I need your help!

The website in question prohibits those posting products from including contact information (email addresses usually) in the product description. Sometimes people ignore the rule, and include the contact information anyways. When this occurs, the website automatically hides the contact information in the product description, replacing it with [obscured], as in "...please feel free to contact me at [obscured]" or something close to that. The [obscured] appears in a different colour, and is obviously treated differently by the website.

When these product descriptions are imported into my spreadsheet, the [obscured] causes the scraping to kind of be 'bumped'-- the description text stops prior to [obscured], the word [obscured] appears in an adjacent cell all by itself, and the description text that follows [obscured] then continues in a third cell.

This separation ruins the alignment and logic in my spreadsheet, as product descriptions having an [obscured] word become broken up and misaligned from those that do not.

I would love to be able to have my importXML or XPath accommodate for this, and essentially 'ignore' the [obscured]. I don't mind it being included in the scraped description, but I want to stop the breaking-up into 3 separate adjacent cells.

The [obscured] is part of a 'span' that appears to occasionally lie within the description class 'desc' I am calling.

Is there a way to do this? Instruct importXML to import that 'desc' class BUT 'ignore/omit/exception' of the span which might sometimes appear within?

I've included the source code (inspect element in Safari) below:

<div class="desc descFull collapsed">
<span class="obscureText">[obscured]</span>

As mentioned, this span only occurs in some of the product descriptions, not all of them. Does anyone know what kind of language I would use in the importXML to call the 'desc' but ignore the 'span', or prevent the splitting into 3 cells when the [obscured] is encountered??

My current call is

=ImportXML(A1,"//div[@class='desc']")

which works fine, unless the [obscured] span is encountered.

Thank you for any help you can give!

Unknown Unknown · Accepted Answer · 2014-03-19T01:08:04

Unless Google Drive is breaking the definition of Xpath, Xpaths can't be used to query CSS classes, like CSS selectors can.

The Xpath //div[@class='desc'] will only match a div element with a class attribute that is literally "desc". It won't match "desc descFull collapsed" as the string is different.

As for excluding the text of the obscured node, that would require finding the text nodes and exluding on, which would return a nodeset, not a string, and you wouldn't be able to concatenate these back together using XPath 1.0. If Google Drive uses XPath 2.0 it might be possible, using the techniques in that linked question.

ImportXML and Google Spreadsheet issue

1 Answers