2
votes

I'm trying to 'scrape' some data from a website (names). I know how to get the first name on the list -- but there are a few thousand names that I need to save in the same way.

Here's my code:


library(rvest)
library(tidyverse)

url <- ("https://www.advokatsamfundet.se/Advokatsamfundet-engelska/Find-a-lawyer/Search-result/?firstname=&lastname=&companyname=&postalcity=&country=4401&spokenlanguage=&sortingcity=&positions=102001")


names <- url %>% 
  read_html() %>% 
    html_elements(xpath = '/html/body/div[3]/div/div/main/div[2]/div[2]/div[1]/a') %>% 
  html_text()

This gives me the first name on the list as it is in the table.

The names follows this simple structure:

'/html/body/div[3]/div/div/main/div[2]/div[2]/div[1]/a')
'/html/body/div[3]/div/div/main/div[2]/div[3]/div[1]/a')
'/html/body/div[3]/div/div/main/div[2]/div[4]/div[1]/a')

Notice that we increase by 1 for each name. It ends on 6212.

I started working on a function, but I'm not getting anywhere. Here it is anyway -- but it doen't work and I think it may be a dead end.

scrape_fun <- function(.x){
  names %>% 
  html_elements(xpath = '/html/body/div[3]/div/div/main/div[2]/div[.x]/div[1]/a') %>% 
  html_text()
}

Any advice on how get it to work? All 6212 names?

2
You could use a for loop.,. But you may find a useful function in the NHSdatadictionaRy package, although it is mainly for <table> rather than <a.> Tags - CALUM Polwart

2 Answers

2
votes

You can use the following css pattern to select them

library(magrittr)
library(rvest)

people <- read_html("https://www.advokatsamfundet.se/Advokatsamfundet-engelska/Find-a-lawyer/Search-result/?firstname=&lastname=&companyname=&postalcity=&country=4401&spokenlanguage=&sortingcity=&positions=102001") %>%
  html_nodes(".c-list .o-flex__item:nth-child(1) > [href]") %>%
  html_text()

This selects for the href attribute, within first child elements (the left most column) with class o-flex__item, which have a shared parent with class c-list. The > is a child combinator specifying that what is on the right is a direct child of what is on the left. It is a more efficient and specific combinator than using a descendant combinator (which can go down nested levels). Class css selectors are the second fastest method after id.

Another pattern might have been html_nodes("[href*=personid]") - this selects for all href attributes containing the string personid.

0
votes

Try this?

library(rvest)
library(tidyverse)

url <- ("https://www.advokatsamfundet.se/Advokatsamfundet-engelska/Find-a-lawyer/Search-result/?firstname=&lastname=&companyname=&postalcity=&country=4401&spokenlanguage=&sortingcity=&positions=102001")

names<- NULL
for (i in 1:6212){
names[i]<- url %>% 
  read_html() %>% 
    html_elements(xpath = paste0("'/html/body/div[3]/div/div/main/div[2]/div[",i,"]/div[1]/a'") )%>% 
  html_text()

}