0
votes

I am currently making a POC script for a news site webscraper. I am new to scraping but have basic familiarity with css tags and xpaths after completing an API usage course on Datacamp. I went to the Bloomberg Europe homepage (I know they have an API, I just wanted a larger news website to test the code on) armed with SelectorGadget and Google Chrome's "select an element in the page to inspect it" functions, copied what I thought were the relevant CSS tags and/or xpaths, and promptly received an empty list when I fed any of them to rvest::html_nodes().

The code I was using is here:

  library(rvest)

url <- "https://www.bloomberg.com/europe"

webpage <- read_html(url) 


xpath_id='//*[contains(concat( " ", @class, " " ), concat( " ", "story-package-module__story__headline-link", " " ))]'
titles_html_xpath <- html_nodes(webpage, xpath = xpath_id)
# xpath returns empty list, try css


titles_html_selectorgadget <- html_nodes(webpage, css =".story-package-module__story__headline")
# also empty, try alternative class tag

titles_html_selectorgadget2 <- html_nodes(webpage, css =".story-package-module__story mod-story")
# still empty!

Any advice as to what the correct tag is (to get article titles in this case) and more importantly how I should go about working out which CSS tag I need in future cases, especially when there are so many css classes layered on top of each other and the tag recommended by SelectorGadget is incorrect?

1

1 Answers

2
votes

Your problem is not in which selectors you are using. The problem is that when you are sending the http request to www.bloomberg.com, it detects that you are not using a standard web browser, and it blocks you because it doesn't want to be scraped. Look:

    library(rvest)
    url <- "https://www.bloomberg.com/europe"
    webpage <- read_html(url)
    html_text(webpage)

    # [1] "Bloomberg - Are you a robot?\n     ... <truncated>

So the html that you are getting from rvest is not the same as the html you are seeing in the developer panel in Chrome.

There may be some workarounds to this involving changing your user agent string in httr, or using RSelenium to scrape the page, or even just starting a firefox browser headlessly in RSelenium and copying its cookies over to httr. Probably easier to use the API, or try parsing the headlines from the news sitemap:

    node_set <-  read_xml("https://www.bloomberg.com/feeds/bbiz/sitemap_news.xml")
    print(head(xml_text(xml_nodes(node_set, xpath = "//news:title"))))

    # [1] "Partners In Health Co-Founder Dr. Paul Farmer on U.S. Healthcare"                      
    # [2] "Partners In Health Co-Founder Dr. Paul Farmer on private, public intersection of funds"
    # [3] "Canada's Trudeau on Losing the Majority in Parliament"                                 
    # [4] "Icehotel Back In Business"                                                             
    # [5] "Can Nostalgia Revive Star Wars?"     

However, for the purposes you describe, it would be better to just pick a different news site on which to practice. The BBC News site should be fine:

    library(rvest)
    url <- "https://www.bbc.co.uk/news"
    webpage <- read_html(url)
    headline_nodes <- html_nodes(webpage, "h3")
    headlines <- html_text(headline_nodes)
    print(head(headlines))

    # [1] "Washing machine danger revealed as recall launched"
    # [2] "Washing machine danger revealed as recall launched"
    # [3] "Black cab rapist 'might never cease to be risk'"   
    # [4] "Brexit bill to rule out extension"                 
    # [5] "'I was getting beaten up while I was asleep'"      
    # [6] "Trump pens irate impeachment letter to Pelosi" 

A good tip here is that if you run into problems parsing html, you should ensure that you are actually getting the html you think you are. Lots of pages are dynamically loaded through Javascript and this can cause chunks of the page you see in the browser to be missing. Or, as in this case, you may be given an unexpected page by the server. You can check whether you have the right page by doing

    library(httr)
    writeClipboard(content(GET(url), "text"))

and inspecting the html you are actually getting by pasting this into your favourite text editor.