2
votes

So I'm trying to import a movie title in Google Sheets using the ImportXML-function, which wants me to give it an Xpath-query. The problem is BoxOfficeMojo, which is structured in a very odd manner, which makes it impossible for me to figure out how to query.

Everything is working all fine and dandy in Google Sheets, but I'm simply unable to generate the correct Xpath-query, that gives me just the title of the movie, when providing the ImportXML-cell with a link.

This is what I have to work with:

I've tried a number of different queries, including the one Chrome will generate for me when I inspect the site, but BoxOfficeMojo is structured in a weird way, which means I can't use any logic to create my query. I've been at it for a few hours now, and the closest I've come to getting the right result is this query:

//*[//table[@border = '0']]/td/font/b

Which gives me this result:

The A-Team

4

I cannot for the life of me figure out, how I filter out the title of the movie ("The A-Team" in this case).

For good measure, this is what Chrome suggests as the Xpath, when I inspect the site:

//*[@id="body"]/table[2]/tbody/tr/td/table[1]/tbody/tr/td[2]/font/b

The above query doesn't work, and just throws an error, as if the query returns nothing...

Possible Solution I've seen some people who have successfully made an Xpath-query, where they search for a certain phrase, and return the content after this phrase (look at this: python: xpath returns empty list from boxofficemojo.com).

However, I cannot seem to figure out how I would search for the title of the movie, since it's not surrounded by any text.

The whole point of scraping something as simple as this title is, that the titles on BoxOfficeMojo changes throughout the year, and I need the exact title in order for me to scrape and compare data in Google Sheets.

Hopefully I've provided an Xpath-wizard in the crowd with enough information to be able to come up with a suggestion for me.

Thanks in advance!

2
Can't help with the fineries for XPATH, but xpath_emitting_cmd ... args... | sed -n '2p' will print just the 2nd line of your output (assuming a linux shell or terminal). (Just an idea if you really get stuck). Good luck! - shellter
Thanks. I don't think that will go over well in Google Sheets unless I script something out. I'll keep it in mind. I have a Scrapy-project going, so it might come in handy, if I decide to scrape the data I need around that way. - TheRecruit
Ok, that makes sense. Didn't quite get the limitation about google-sheets. Otherwise an excellent Q. I hope you get an XPath guru to help you. Good luck. - shellter
Try (//table//font/b)[1] or //font[@size="6"]/b or //font[@face="Verdana"]/b - Andersson
I've checked one page only. If The font size can be, for instance 5 or 6, you can try //font[@size > 5]/b. Elements on page have too little unique attributes, so it might take some time do define generic XPath. You can try to combine predicates as //font[@size > 5 and @face="Verdana"]/b... - Andersson

2 Answers

2
votes

I tried this xpaths for a few movies and worked fine

//font[@face="Verdana" and @size="6"]/b
2
votes

As Andersson kindly suggested, using this query:

//font[@size="6"]/b

Did the trick :)

EDIT:

It turned out some movies would have their titles formatted in a way that caused the result to be split among several cells, and thus break my spreadsheet.

This solution however, seems to be working for all movies, and will return a single cell with the titles only:

=JOIN(" ";IMPORTXML(H81;"(//font[@face='Verdana']/b)[2]"))