1
votes

I am trying to collect all individual URLs (lawyer's URL) from this website - https://www.linklaters.com/en/find-a-lawyer. I can't find a way how to extract the URLs- when i use CSS selector its not working. Could you suggest other way to find specific element in a web-page? Also to collect all the data I need to click on the button "Load More" and I am using RSelenium. I think that I am not doing something correct with running Rselenium through docker as it appear the error - Error in checkError(res) : Undefined error in httr call. httr output: Failed to connect to localhost port 4445: Connection refused

library(dplyr)
library(rvest)
library(stringr)
library(RSelenium)

link = "https://www.linklaters.com/en/find-a-lawyer"
hlink = read_html(link)
urls <- hlink %>%
        html_nodes(".listCta__subtitle--top") %>%
        html_attr("href")
urls <- as.data.frame(urls, stringsAsFactors = FALSE)
names(urls) <- "urls"

remDr <- RSelenium::remoteDriver(remoteServerAddr = "localhost",
                                 port = 4445L,
                                 browserName = "chrome")
remDr$open()

replicate(20,
          {       # scroll down
                  webElem <- remDr$findElement("css", "body")
                  webElem$sendKeysToElement(list(key = "end"))
                  # find button
                  allURL <- remDr$findElement(using = "css selector", ".listCta__subtitle--top")
                  # click button
                  allURL$clickElement()
                  Sys.sleep(6)
          })

allURL <- xml2::read_html(remDr$getPageSource()[[1]])%>%
        rvest::html_nodes(".field--type-ds a") %>%
        html_attr("href")
1

1 Answers

1
votes

It's just loading dynamic data over XHR requests. Just grab the lovely JSON:

jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers")
jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers?searchTerm=&sort=asc&showing=30")
jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers?searchTerm=&sort=asc&showing=60")
jsonlite::fromJSON("https://www.linklaters.com/en/api/lawyers/getlawyers?searchTerm=&sort=asc&showing=90")

Keep incrementing by 30 until an errant result comes back, preferably with a 5s sleep delay between requests so as not to come off as a jerk.