How to search PubMed or other databases using R

Question

I have recently been using the excellent rplos package, which makes it very easy to search through papers hosted on the Public Library of Science (PLOS) API. I've hit a snag, in that the API itself seems to have some missing information - a major one being that there are at least 2012 papers on the API for which there is no information in the "journal" field. I have the DOIs of each paper, so it is simple to Google the DOI and show that these are real papers published in real journals, usually PLoS ONE. Obviously it would be silly to do that 2000 times.

I was wondering if anyone knows how to find the source journal, if I have the list of DOIs? I looked into the RISmed package, which can apparently search PubMed from within R, but I could not work out how to make it give useful information (just the number of search hits, and some PubMed IDs that probably lead to the info I want).

Anyone know how to turn the list of DOIs into source journal names?

EDIT: I just thought of another easy solution. DOIs contain an abbreviation of the journal name, and for a case like this where there are only a handful of journals, one can just use regular expressions to read the DOIs and pick which journal they are from. Example: 10.1371/journal.pone.0046711 is from PLoS ONE.

There's rpubmed but you might also be interested in rmetadata. — Thomas
Thanks a lot! I have written an answer using rpubmed. Likely not the easiest way, but seems to work. — lukeholman

lukeholman lukeholman · Accepted Answer · 2014-03-09T05:34:32

Here's an answer based on Thomas' suggestion to try rpubmed. It starts with a list of the problematic DOIs, finds the matching PubMed ID numbers using the EUtilsSummary function in RISmed, and then getting the journal data associated with these using code modified from the Github for rpubmed and reproduced below. Sorry for editing the rpubmed code, but the objects on line 44 do not seem to be defined or essential so I took them out.

library(RCurl); library(XML); library(RISmed); library(multicore)

# dummy list of 5 DOIs. I actually have 2012, hence all the multicoring below
dois <- c("10.1371/journal.pone.0046711", "10.1371/journal.pone.0046681", "10.1371/journal.pone.0046643", "10.1371/journal.pone.0041465", "10.1371/journal.pone.0044562")

# Get the PubMed IDs
res <- mclapply(1:length(dois), function(x) EUtilsSummary(dois[x]))
ids<-sapply(res,QueryId)


######## rpubmed functions from https://github.com/rOpenHealth/rpubmed/blob/master/R/rpubmed_fetch.R
fetch_in_chunks <- function(ids, chunk_size = 500, delay = 0, ...){
  Sys.sleep(delay * 3600) # Wait for appropriate time for the server.
  chunks <- chunker(ids, chunk_size)
  Reduce(append, lapply(chunks, function(x) pubmed_fetch(x, ...)))
}

pubmed_fetch <- function(ids, file_format = "xml", as_r_object = TRUE, ...){

  args <- c(id = paste(ids, collapse = ","), db = "pubmed", rettype = file_format, ...)

  url_args <- paste(paste(names(args), args, sep="="), collapse = "&")
  base_url <- "http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?retmode=full"
  url_string <- paste(base_url, url_args, sep = "&")
  records <- getURL(url_string)
  #NCBI limits requests to three per second
  Sys.sleep(0.33)
  if(as_r_object){
    return(xmlToList(xmlTreeParse(records, useInternalNodes = TRUE)))
  } else return(records)
}

chunker <- function(v, chunk_size){
  split(v, ceiling(seq_along(v)/chunk_size))
}
###### End of rpubmed functions

d<-fetch_in_chunks(ids)
j<-character(0)
for(i in 1:2012) j[i]<-as.character(d[[i]][[1]][[5]][[1]][[3]]) # the tortuous path to the journal name

How to search PubMed or other databases using R

3 Answers