I am attempting to scrape (dynamic?) content from a webpage using the rvest package. I understand that dynamic content should require the use of tools such as Selenium or PhantomJS.
However my experimentation leads me to believe I should still be able to find the content I want using only standard webscraping r packages (rvest,httr,xml2).
For this example I will be using a google maps webpage. Here is the example url...
https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/
If you follow the hyperlink above it will take you to an example webpage. The content I would want in this example are the addresses "920 NC-16, Crumpler, NC 28617" and "2114 NC-16, Newton, NC 28658" in the top left corner of the webpage.
Standard techniques using the css selector or xpath did not work, which initially made sense, as I thought this content was dynamic.
url<-"https://www.google.com/maps/dir/920+nc-16-br,+denver,+nc,+28037/2114+hwy+16,+denver,+nc,+28037/"
page<-read_html(url)
# The commands below all return {xml nodeset 0}
html_nodes(page,css=".tactile-searchbox-input")
html_nodes(page,css="#sb_ifc50 > input")
html_nodes(page,xpath='//*[contains(concat( " ", @class, " " ), concat( " ", "tactile-searchbox-input", " " ))]')
The commands above all return "{xml nodeset 0}" which I thought was a result of this content being generated dynamically, but here's were my confusion lies, if I convert the whole page to text using html_text() I can find the addresses in the value returned.
html_text(read_html(url))
substring<-substr(x,33561-100,33561+300)
Executing the commands above results in a substring with the following value,
"null,null,null,null,[null,null,null,null,null,null,null,[[[\"920 NC-16, Crumpler, NC 28617\",null,null,null,null,null,null,null,null,null,null,\"Nzm5FTtId895YoaYC4wZqUnMsBJ2rlGI\"]\n,[\"2114 NC-16, Newton, NC 28658\",null,null,null,null,null,null,null,null,null,null,\"RIU-FSdWnM8f-IiOQhDwLoMoaMWYNVGI\"]\n]\n,null,null,0,null,[[null,null,null,null,null,null,null,3]\n,[null,null,null,null,[null,null,null,null,nu"
The substring is very messy but contains the content I need. I've heard parsing webpages using regex is frowned upon but I cannot think of any other way of obtaining this content which would also avoid the use of dynamic scraping tools.
If anyone has any suggestions for parsing the html returned or can explain why I am unable to find the content using xpath or css selectors, but can find it by simply parsing the raw html text, it would be greatly appreciated.
Thanks for your time.