rvest: getting links from css node error: no applicable method for 'xml_find_all'

Question

I would like to determine the number of pages from pagination on the page: https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false

============
Table
============
     Pagination: Link1, Link2, Link3, Link4, LinkNext,Link Last

With selector gadget I identified the pagination is in ".pagination-container, a"

I would like to

dump all the links in the pagination to a vector or data.frame
get the last number in the urls strings
determine max number indicating how many pages are there in the pagination to use it later on in a scraping loop

Following http://francojc.github.io/web-scraping-with-rvest/

I started with

library(tidyverse)
library(rvest)

url <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false"

urls <- url %>% # feed `main.page` to the next step
  html_nodes(".pagination-container, a") %>% # get the CSS nodes
  html_text("href")

On html_nodes it throws an error

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"

What am I doing wrong?

where's the read_html()? And, you likely want html_attr("href") vs html_text("href"). — hrbrmstr

hrbrmstr hrbrmstr · Accepted Answer · 2017-04-13T16:18:06

Beyond the "typo" (i.e. missing the call to read_html()) there's an easier way to get the total pages. Just target the [>>] link in the paginatior:

library(rvest)
library(stringi)
library(tidyverse)

url <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false"

pg <- read_html(url)

html_nodes(pg, "li.PagedList-skipToLast > a") %>% 
  html_attr("href") %>% 
  stri_match_last_regex("page=([[:digit:]]+)") %>% 
  .[,2]
## [1] "13"

rvest: getting links from css node error: no applicable method for 'xml_find_all'

1 Answers