0
votes

Having issues scraping Twitter pages using importXML in Google Sheets.

The below was working fine last week, but now responds with the error "Imported XML content cannot be parsed."

The URL is https://twitter.com/search?q=anyone%20recommend%20restaurant%20london%20since%3A2015-03-16%20until%3A2015-09-16

and the xpath is "//span[@class='username js-action-profile-name']"

1
Thanks so much, so guessing it's for twitter to fix. Very annoying though as it was working fine last week. - Andy Allen

1 Answers

0
votes

The message is correct, the data at that URL is not valid XML. For instance, the line:

<noscript><meta http-equiv="refresh" content="0; URL=https://mobile.twitter.com/i/nojs_router?path=%2Fsearch&amp;q=anyone%20recommend%20restaurant%20london%20since%3A2015-03-16%20until%3A2015-09-16"></noscript>

Is not valid, the meta element is not closed. Likewise the script element contains a lot of reserved, unescaped characters.

Unless you use some kind of tool that turns HTML into a DOM tree, there is not much you can do given that document. Except perhaps using a tool like Selenium that can get the DOM tree a browser generates.

Since you are scraping Twitter, you can probably better and easier use the Twitter REST API. Much easier and robust.