Scraping TripAdvisor link with Google Sheets' importxml function not working

Question

I'm trying to scrape a link from TripAdvisor using the importxml function in google sheets. Here is an example:

http://www.tripadvisor.com/Restaurant_Review-g34127-d491231-Reviews-Celebration_Town_Tavern-Celebration_Florida.html

The link is the "Great Vibe, Great Food” title it is:

/ShowUserReviews-g34127-d491231-r257722735-Celebration_Town_Tavern-Celebration_Florida.html#REVIEWS"

The challenge is that I would like to pull the similar links from multiple TripAdvisor pages (just the latest review) and the id within the tag changes.

I have tried using the XPATH

"//*[@class='wrap']/@href"

This is not working.

Welcome to Stack Overflow! Please edit your question to add sample HTML snippets for various pages you want to scrape properly. — Nathan Tuggy
You're going to have to give us some examples of the inputs, the expected output, and the actual output, in the question itself. "This is not working" is not enough information for us to help you. — Jeffrey Bosboom

bjimba bjimba · Accepted Answer · 2015-03-08T00:25:11

I grabbed a bit of the source:

<div class="wrap">
  <div class="quote isNew">
    <a href="/ShowUserReviews-g34127-d491231-r257722735-Celebration_Town_Tavern-Celebration_Florida.html#REVIEWS" onclick="ta.setEvtCookie('Reviews','title','',0,this.href); ta.util.cookie.setPIDCookie('4442')" id="rn257722735">&#x201c;<span class='noQuotes'>Great Vibe, Great Food</span>&#x201d;</a>
  </div>

You tried //*[@class='wrap']/@href which says "find any element with a class attribute = 'wrap', and give me that element's href attribute's value". It finds <div class="wrap"> which has no href attribute.

You need to find the anchor (element <a>) and get its href. Since there is another div level, you need something like:

//div[@class='wrap']/div[@class='quote isNew']/a/@href

I'll leave it to you to analyze the input source for the specific rules you need. The important part is to end up selecting the <a> element and getting the @href from there.

Scraping TripAdvisor link with Google Sheets' importxml function not working

1 Answers