R: Issues when scraping tables from FIFA using rvest

Question

I'm trying scrape data from every team that has participated in the World Cup at least once in the past 30 years.

My knowledge of how to use the R package rvest to scrape tables and whatnot from the web is rudimentary at best.

Currently, my code looks like

library(rvest)
library(dplyr)
fifadata <- read_html("http://www.fifa.com/fifa-tournaments/teams/association=BRA/index.html")
fifa_data_html <-  
  html_nodes(fifadata, 
         xpath='/html/body/div[1]/div[5]/div/div[4]/div/div[2]/div/div/div[1]/div/table') %>%
  html_table(header=FALSE, fill=TRUE)
fifa_data_html

The first table on the webpage is what I want to scrape, but when I run the above code, html_nodes() returns {xml_nodeset (0)}.

Any input into how to go about scraping the table in question properly would be much appreciated.

So you are trying to extract the part where Brazil 2014, South Africa 2010 is listed? — amrrs

MichaelChirico MichaelChirico · Accepted Answer · 2017-11-22T06:07:23

Here's something. It's quite a mess:

xp = paste0('//li[@class="tbl-cupname"]/',
            'div[@class="label-data"]/',
            'span[@class="text"][text()="FIFA World Cup™"]/../../',
            'following-sibling::li[@class="tbl-appearances"]/',
            'div[@class="label-data"]/',
            'span[@class="text"]')
fifadata %>% html_nodes(xpath = xp) %>% html_text %>% as.integer
# [1] 20

Let's break down the logic.

The naive query:

fifadata %>% html_nodes(
    xpath = '//li[@class="tbl-appearances"]/div[@class="label-data"]/span'
)

Is sufficient to get us the four rows giving the number of appearances in each of the four tournaments listed on this page. If the web designers are merciful, this is sufficient -- just select the first of these from each page you'd like to scrape, and you'll have what you're after.

This is not robust, however -- it will give incorrect results whenever the row order changes, or if the row you want is absent.

The query presented takes care of this.

First, we identify the rows associated with FIFA World Cup. The essential structure there is:

<li class="tbl-cupname">
  <div class="label-data">
    <span class="text"> n_appearances </span>
  </div>
</li>

We use the class attributes since there are other li and div nearby that we want to be sure to exclude. So, we can select the four rows corresponding to the tournaments (FIFA World Cup, FIFA Confederations Cup, FIFA Women's World Cup, and Women's Olympic Football Tournament) with:

fifadata %>% html_nodes(xpath = '//li[@class="tbl-cupname"]')

Eliminating the three tournaments that are irrelevant to your pursuit requires a condition on the <span> element, hence the rest of the first part:

xp_part_1 = paste0('//li[@class="tbl-cupname"]/',
                   'div[@class="label-data"]/',
                   'span[@class="text"][text()="FIFA World Cup™"]')
fifadata %>% html_nodes(xpath = xp_part_1)

This selects the tournament, however, we want the subsequent li which contains the number of appearances. The core structure we're touching here is:

<li class="tbl-cupname"> </li>
<li class="tbl-appearances"> </li>

Part 1 of the xpath has navigated us down two levels below this li, however, so we need to "ascend" the nodes with .. (this is exactly like cd .. in the Linux terminal to go up a level, so hopefully that's reminiscent).

We then use the following-sibling syntax to select nodes that are at the same level as the current node, but come subsequently.

Once we're back on the same level as the li naming the tournament, we can continue with the "naive" query to drill down to the number of appearances.

R: Issues when scraping tables from FIFA using rvest

1 Answers