Here's something. It's quite a mess:
xp = paste0('//li[@class="tbl-cupname"]/',
'div[@class="label-data"]/',
'span[@class="text"][text()="FIFA World Cup™"]/../../',
'following-sibling::li[@class="tbl-appearances"]/',
'div[@class="label-data"]/',
'span[@class="text"]')
fifadata %>% html_nodes(xpath = xp) %>% html_text %>% as.integer
# [1] 20
Let's break down the logic.
The naive query:
fifadata %>% html_nodes(
xpath = '//li[@class="tbl-appearances"]/div[@class="label-data"]/span'
)
Is sufficient to get us the four rows giving the number of appearances in each of the four tournaments listed on this page. If the web designers are merciful, this is sufficient -- just select the first of these from each page you'd like to scrape, and you'll have what you're after.
This is not robust, however -- it will give incorrect results whenever the row order changes, or if the row you want is absent.
The query presented takes care of this.
First, we identify the rows associated with FIFA World Cup. The essential structure there is:
<li class="tbl-cupname">
<div class="label-data">
<span class="text"> n_appearances </span>
</div>
</li>
We use the class
attributes since there are other li
and div
nearby that we want to be sure to exclude. So, we can select the four rows corresponding to the tournaments (FIFA World Cup, FIFA Confederations Cup, FIFA Women's World Cup, and Women's Olympic Football Tournament) with:
fifadata %>% html_nodes(xpath = '//li[@class="tbl-cupname"]')
Eliminating the three tournaments that are irrelevant to your pursuit requires a condition on the <span>
element, hence the rest of the first part:
xp_part_1 = paste0('//li[@class="tbl-cupname"]/',
'div[@class="label-data"]/',
'span[@class="text"][text()="FIFA World Cup™"]')
fifadata %>% html_nodes(xpath = xp_part_1)
This selects the tournament, however, we want the subsequent li
which contains the number of appearances. The core structure we're touching here is:
<li class="tbl-cupname"> </li>
<li class="tbl-appearances"> </li>
Part 1 of the xpath has navigated us down two levels below this li
, however, so we need to "ascend" the nodes with ..
(this is exactly like cd ..
in the Linux terminal to go up a level, so hopefully that's reminiscent).
We then use the following-sibling
syntax to select nodes that are at the same level as the current node, but come subsequently.
Once we're back on the same level as the li
naming the tournament, we can continue with the "naive" query to drill down to the number of appearances.