0
votes

I am working on web scraping using Rvest in R. I tried to fetch data from search pages which has 12 pages. And I wrote a code to iterate page to collect data from each page. But my code collects only 1st page repeatedly. Here is the sample for my code.

# New method for Pagination
url_base <- "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?SortBy=1&Distance=400&ResultsPerPage=10&Name=e.g.%20Singh%20or%20John%20Smith&Specialty=230&Location.Id=0&Location.Name=e.g.%20postcode%20or%20town&Location.Longitude=0&Location.Latitude=0&CurrentPage=1&OnlyViewConsultantsWithOutcomeData=False"
map_df(1:12, function(i) {
  cat(".")
  pg <- read_html(sprintf(url_base,i))
  data.frame(consultant_name = html_text(html_nodes(pg,".consultants-list h2 a")))
  
}) -> names

dplyr::glimpse(names)

Edited Version of code:

# New method for Pagination
url_base  <-  "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?ResultsPerPage=100&defaultConsultantName=e.g.+Singh+or+John+Smith&DefaultLocationText=e.g.+postcode+or+town&DefaultSearchDistance=25&Name=e.g.+Singh+or+John+Smith&Specialty=230&Location.Name=e.g.+postcode+or+town&Location.Id=0&CurrentPage=%d"
map_df(1:12, function(i) {
  cat(".")
  pg <- read_html(sprintf(url_base,i))
  data.frame(consultant_name = html_text(html_nodes(pg,".consultants-list h2 a")),
             gmc_no = gsub("GMC membership number: ","",html_text(html_nodes(pg,".consultants-list .name-number p"))),
             Speciality = html_text(html_nodes(pg,".consultants-list .specialties ul li")),
             location = html_text(html_nodes(pg,".consultants-list .consultant-services ul li")),stringsAsFactors=FALSE)
  
}) -> names

dplyr::glimpse(names)

The above code accepts 8 loops fetching rows of 800 i.e 100 per page but then it throes an error.

.........Error in data.frame(consultant_name = html_text(html_nodes(pg, ".consultants-list h2 a")), : arguments imply differing number of rows: 100, 101 Called from: data.frame(consultant_name = html_text(html_nodes(pg, ".consultants-list h2 a")), gmc_no = gsub("GMC membership number: ", "", html_text(html_nodes(pg, ".consultants-list .name-number p"))), Speciality = html_text(html_nodes(pg, ".consultants-list .specialties ul li")), location = html_text(html_nodes(pg, ".consultants-list .consultant-services ul li")), stringsAsFactors = FALSE) Browse[1]>

I tried to change the loop numbers but no luck.

Please help me to solve this!!!

1
you don't specify in the url which page you're requesting, how the website will know that you need next one? Look at the pagination urls and you'll see what's changing. Follow that, or grab url which is under 'Next' anchor untill you hit end (no 'Next' anchor).LukasS
@LukasS I have mentioned in my URL, it starts from page 1. URL link "nhs.uk/service-search/Hospital/LocationSearch/7/…"Sharmi
I can see clearly, but you have wrong url there. Look at the pagination links.LukasS
@LukasS I have updated the URL link in my question. Still pages were not reading. And this is the URL link I am getting, don't get you which link you are referring to.Sharmi
You need to increment appropriate values, after successful response, or simpler: extract the url from anchor with 'Next', then you will satisfy two conditions (calculating next value and checking if there are more pages to crawl).LukasS

1 Answers

1
votes

Here's what I came up with after looking at the pattern of URL.

library(tidyverse)
library(rvest)

base_url <- "https://www.nhs.uk/service-search/Hospital/LocationSearch/7/ConsultantResults?Specialty="

# change the code to pull other specialities
specialty_code = 230 # ie. Anaesthesia services = 230

# show 100 per page    
tgt_url <- str_c(base_url,specialty_code,"&ResultsPerPage=100&CurrentPage=")

pg <- read_html(tgt_url)

# count the total results and set the page count
res_cnt <- pg %>% html_nodes('.fcresultsinfo li:nth-child(1)') %>% html_text() %>% str_remove('.* of ') %>% as.numeric()
pg_cnt = ceiling(res_cnt / 100)

res_all <- NULL
for (i in 1:pg_cnt) {

pg <- read_html(str_c(tgt_url,i))
res_pg <- tibble(
            consultant_name = pg %>% html_nodes(".consultants-list h2 a") %>% html_text(),
            gmc_no = pg %>% html_nodes(".consultants-list .name-number p") %>% html_text() %>% 
                            str_remove("GMC membership number: "),
            speciality = pg %>% html_nodes(".consultants-list .specialties ul") %>% 
                                html_text() %>% str_replace_all(', \r\n\\s+',', ') %>% str_trim(),
            location = pg %>% html_nodes(".consultants-list .consultant-services ul") %>%
                              html_text() %>% str_replace_all(', \r\n\\s+',', ') %>% str_trim(),
            src_link = pg %>% html_nodes(".consultants-list h2 a") %>% html_attr('href')
            ) 

res_all <- res_all %>% bind_rows(res_pg)

}

This is what I get:

> nrow(res_all)
## [1] 1141
> res_all %>% select(1:4) %>% tail()
## # A tibble: 6 x 4
##  consultant_name      gmc_no  speciality           location                                        
##  <chr>                <chr>   <chr>                <chr>                                           
## 1 Mark Yeates          4716345 Anaesthesia services The Great Western Hospital                      
## 2 Steven Yentis        2939700 Anaesthesia services Chelsea and Westminster Hospital                
## 3 Louise Young         6139457 Anaesthesia services Southampton General Hospital                    
## 4 Andreas Zafiropoulos 6075484 Anaesthesia services Shrewsbury and Telford Hospital NHS Trust       
## 5 Suhail Zaidi         4239598 Anaesthesia services Luton and Dunstable Hospital                    
## 6 Cezary Zugaj         4751331 Anaesthesia services Oxford University Hospitals NHS Foundation Trust