0
votes

I am looking to web scrape all the codes and the codes under each hierarchy as seen on the left panel from this website using R package rvest.

URL-- http://apps.who.int/classifications/icd10/browse/2016/en/

To begin with I tried this code-

url<-"http://apps.who.int/classifications/icd10/browse/2016/en/"
src<-read_html(url)
ATC<-src%>%html_node("a.ygtvlabel")%>%html_text

a.ygtvlbel is the class I see when hovering on the text in the web page.

However it just returns NA_character. I see that the html source for the page, does not directly contain these codes(Ex- parasitic diseases) but instead it probably has an href to all the contents.

How Can I go about scraping such a page. Kindly advise.

1
b/c using the actual API wld be bad? cran.r-project.org/web/packages/WHO/index.html - hrbrmstr
Thank you @hrbrmstr. The API actually gave a new direction of thought. From the hint, I used the R package - icd and got the main chapters and subchapters from the package defined variables as I m looking specifically for ICD10 codes. Could not get the lower most level codes (I mean A00.0 Cholera due to Vibrio cholerae 01, biovar cholerae). But I m wondering if I mixed up using the API to package, will explore more. - Meenakshi Vikram
'icd' is currently limited to the US ICD-9-CM and ICD-10-CM which are mostly supersets of the corresponding WHO schemes. There are however some areas with more detail in WHO, notably HIV, which is more limited in the US versions. WHO astonishingly asserts copyright over their versions of ICD-9 and ICD-10, so currently unable to distribute as part of 'icd' or 'icd.data' packages. - Jack Wasey
Unfortunately the WHO package deals with WHO data files, but not classifications. It is possible to obtain machine readable definitions of WHO ICD codes from the WHO, if you electronically sign an agreement and do not redistribute. - Jack Wasey

1 Answers

1
votes

As with many of these kinds of pages, this page makes a background request for a json file that contains the data. This can be discovered by using browser debug tools and looking at the network requests. Using an API as noted in comments is a better choice

library(httr)
library(jsonlite)

## original url<-"http://apps.who.int/classifications/icd10/browse/2016/en/"

json_url <- "http://apps.who.int/classifications/icd10/browse/2016/en/JsonGetRootConcepts?useHtml=false"
json_data <- rawToChar(GET(json_url)$content)

categories <- fromJSON(json_data)
categories$label
# [1] "I Certain infectious and parasitic diseases"                                                            
# [2] "II Neoplasms"                                                                                           
# [3] "III Diseases of the blood and blood-forming organs and certain disorders involving the immune mechanism"
# [4] "IV Endocrine, nutritional and metabolic diseases"                                                       
# gories$label