2
votes

I am converting output from an API call to a bibliography database, that returns content in RIS form. I would then like to get a data.table object, with a row for each database item, and a column for each field of the RIS output.

I will explain more about RIS later, but I am stuck in the following:

I would like to get a data.table using something like:

PubDB <- as.data.table(list(TY = "txtTY",TI = "txtTI"))

which returns:

PubDB

      TY    TI
1: txtTY txtTI

However, what I have is a string (actually a vector of strings returned from API call: PubStr is one element)

PubStr

## [1] "TY = \"txtTY\",TI = \"txtTI\" "

How can I convert this string to the list needed inside the as.data.table command above?

More specifically, following the first steps of my code, after resp<-GET(url), rawToChar(resp$content) and as.data.table() after some string manipulation, I have a data table with rows for each publication, and one column called PubStr that has the string as above. How to convert this string to many columns, for each row of the data.table. Note: some rows have more or fewer fields.

1
A quick google doesn't show up any relevant open source libraries, but I'm finding it difficult to believe no-one has written a parser for this format. It would be worth checking with whoever has given you this task. Otherwise you're going to have to define a suitable data table structure and parse the data into it. but it probably shouldn't be that hard to code. I'd suggest you code it against the RIS format rules and use your data for testing, rather than vice versa; and that once you have got it working you publish it as open source so that others don't have to write their own parsers. - MandyShaw
A quick google actually did show a few relevant open source libraries. Here's one example of RIS parser code from github.com/cran/ris/blob/master/R/read.ris.R and another from rdrr.io/github/agoldst/mlaibr/src/R/read_ris.R and the github.com/ropensci/RefManageR github.com/ropensci/RefManageR 📦 also has some code for RIS files. Modifying to use API results should be fairly straightforward. - hrbrmstr

1 Answers

0
votes

I am unsure of RIS format but if each element of these strings are separated by commas and then within each comma the header column names are separated by the equal sign then here is a quick and dirty function that uses base R and data.table:

RIS_parser_fn<-function(x){

string_parse_list<-lapply(lapply(x,
                                 function(i) tstrsplit(i,",")),
                          function(j) lapply(tstrsplit(j,"="),
                                            function(k) t(gsub("\\W","",k))))

datatable_format<-rbindlist(lapply(lapply(string_parse_list,
                                          function(i) data.table(Reduce("rbind",i))),
                                   function(j) setnames(j,unlist(j[1,]))[-1]),fill = T)

return(datatable_format)
}

The first line of code simply creates a list of lists which contain 2 lists of matrices. The outer list has the number of elements equal to the size of the initial vector of strings. The inner list has exactly two matrix elements with the number of columns equal to the number of fields in each string element determined by the ',' sign. The first matrix in each list of lists consists of the columns headers (determined by the '=' sign) and the second matrix contains the values they are equal to. The last gsub simply removes any special characters remaining in the matrices. May need to modify this if you want nonalphanumeric characters to be present in the values. There were not any in your example.

The second line of code converts these lists into one data.table object. The Reduce function simply rbinds the 2 element lists and then converts them to data.tables. Hence there is now only one list consisting of data.tables for each initial string element. The "j" lapply function sets the column names to the first row of the matrix and then removes that row from the data.table. The final rbindlist call combines the list of the data.tables which have varying number of columns. Set the fill=T to allow them to be combined and NAs will be assigned to cells that do not have that particular field.

I added a second string element with one more field to test the code:

 PubStr<-c("TY = \"txtTY1\",TI = \"txtTI1\"","TY = \"txtTY2\",TI = \"txtTI2\" ,TF = \"txtTF2\"")

 RIS_parser_fn(PubStr)

Returns this:

   TY     TI     TF
1: txtTY1 txtTI1   <NA>
2: txtTY2 txtTI2 txtTF2

Hopefully this will help you out and/or stimulate some ideas for more efficient code. Best of luck!