0
votes

I would like to determine the number of pages from pagination on the page: https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false

============
Table
============
     Pagination: Link1, Link2, Link3, Link4, LinkNext,Link Last

With selector gadget I identified the pagination is in ".pagination-container, a"

I would like to

  1. dump all the links in the pagination to a vector or data.frame
  2. get the last number in the urls strings
  3. determine max number indicating how many pages are there in the pagination to use it later on in a scraping loop

Following http://francojc.github.io/web-scraping-with-rvest/

I started with

library(tidyverse)
library(rvest)

url <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false"

urls <- url %>% # feed `main.page` to the next step
  html_nodes(".pagination-container, a") %>% # get the CSS nodes
  html_text("href")  

On html_nodes it throws an error

Error in UseMethod("xml_find_all") : 
  no applicable method for 'xml_find_all' applied to an object of class "character"

What am I doing wrong?

1
where's the read_html()? And, you likely want html_attr("href") vs html_text("href"). - hrbrmstr

1 Answers

4
votes

Beyond the "typo" (i.e. missing the call to read_html()) there's an easier way to get the total pages. Just target the [>>] link in the paginatior:

library(rvest)
library(stringi)
library(tidyverse)

url <- "https://aplikacje.nfz.gov.pl/umowy/Provider/Index?ROK=2017&OW=07&ServiceType=03&Code=&Name=&City=&Nip=&Regon=&Product=&OrthopedicSupply=false"

pg <- read_html(url)

html_nodes(pg, "li.PagedList-skipToLast > a") %>% 
  html_attr("href") %>% 
  stri_match_last_regex("page=([[:digit:]]+)") %>% 
  .[,2]
## [1] "13"