Pulling PubMed data into Google Sheets

Question

I'm looking for some help. I am trying to grab an author's publications from PubMed and populate the data into Google Sheets using Apps Script. I've gotten as far as the code below and am now stuck.

Basically, what I have done was first pull all the Pubmed IDs from a particular author whose name comes from the name of the sheet. Then I have tried creating a loop to go through each Pubmed ID JSON summary and pull each field I want. I have been able to pull the pub date. I had set it up with the idea that I would do a loop for each field of that PMID I want, store it in an array, and then return it to my sheet. However, I'm now stuck trying to get the second field - title - and all the subsequent fields (e.g. authors, last author, first author, etc.)

Any help would be greatly appreciated.

function IMPORTPMID(){
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];                   
  var author = sheet.getSheetName();

  var url = ("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=" + author + "[author]&retmode=json&retmax=1000");
  var response = UrlFetchApp.fetch(url);
  var AllAuthorPMID = JSON.parse(response.getContentText());

  var xpath = "esearchresult/idlist";
  var patharray = xpath.split("/");
  for (var i = 0; i < patharray.length; i++) {
     AllAuthorPMID = AllAuthorPMID[patharray[i]];
  }

  var PMID = AllAuthorPMID;
  var PDparsearray = [PMID.length];
  var titleparsearray = [PMID.length];
  for (var x = 0; x < PMID.length; x++) {
    var urlsum = ("https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=pubmed&retmode=json&rettype=abstract&id=" + PMID[x]);
    var ressum = UrlFetchApp.fetch(urlsum);
    var contentsum = ressum.getContentText();
    var jsonsum = JSON.parse(contentsum);

    var PDpath = "result/" + PMID[x] + "/pubdate";
    var titlepath = "result/" + PMID[x] + "/title";
    var PDpatharray = PDpath.split("/");
    var titlepatharray = titlepath.split("/");

    for (var j = 0; j < PDpatharray.length; j++) {
      var jsonsum = jsonsum[PDpatharray[j]];
    }  

    PDparsearray[x] = jsonsum;
  }

  var tempArr = [];
  for (var obj in AllAuthorPMID) {
    tempArr.push([obj, AllAuthorPMID[obj], PDparsearray[obj]]);
  }

  return tempArr;
}

tehhowch tehhowch · Accepted Answer · 2018-08-18T06:04:24

From a PubMed JSON response for a given PubMed ID, you should be able to determine the fieldnames (and paths to them) that you want to include in your summary report. Reading them all is simpler to implement if they are all at the same level, but if some are properties of a sub-field, you can still access them if you give the right path in your setup.

Consider the "source JSON":

[
  { "pubMedId": "1234",
    "name": "Jay Sahn",
    "publications": [
      { "pubId": "abcd",
        "issn": "A1B2C3",
        "title": "Dynamic JSON Parsing: A Journey into Madness",
        "authors": [
          { "pubMedId": "1234" },
          { "pubMedId": "2345" }
        ]
      },
      { "pubId": "efgh",
        ...
      },
      ...
    ],
    ...
  },
  ...
]

The pubId and issn fields would be at the same level, while the publications and authors would not.

You can retrieve both the pubMedId and publications fields (and others you desire) in the same loop by either 1) hard-coding the field access, or 2) writing code that parses a field path and supplying field paths.

Option 1 is likely to be faster, but much less flexible if you suddenly want to get a new field, since you have to remember how to write the code to access that field, along with where to insert it, etc. God save you if the API changes.

Option 2 is harder to get right, but once right, will (should) work for any field you (properly) specify. Getting a new field is as easy as writing the path to it in the relevant config variable. There are possibly libraries that will do this for you.

To convert the above into spreadsheet rows (one per pubMedId in the outer array, e.g. the IDs you queried their API for), consider this example code:

function foo() {
  const sheet = /* get a sheet reference somehow */;
  const resp = UrlFetchApp.fetch(...).getContentText();
  const data = JSON.parse(resp);
  // paths relative to the outermost field, which for the imaginary source is an array of "author" objects
  const fields = ['pubMedId', 'name', 'publications/pubId', 'publications/title', 'publications/authors/pubMedId'];

  const output = data.map(function (author) {
    var row = fields.map(function (f) {
      var desiredField = f.split('/').reduce(delve_, author);
      return JSON.stringify(desiredField);
    });
    return row;
  });
  sheet.getRange(1, 1, output.length, output[0].length).setValues(output);
}

function delve_(parentObj, property, i, fullPath) {
  // Dive into the given object to get the path. If the parent is an array, access its elements.
  if (parentObj === undefined)
    return;

  // Simple case: parentObj is an Object, and property exists.
  const child = parentObj[property];
  if (child)
    return child;

  // Not a direct property / index, so perhaps a property on an object in an Array.
  if (parentObj.constructor === Array)
    return collate_(parentObj, fullPath.splice(i));

  console.warn({message: "Unhandled case / missing property", 
                args: {parent: parentObj, prop: property, index: i, pathArray: fullPath}});
  return; // property didn't exist, user error.
}
function collate_(arr, fields) {
  // Obtain the given property from all elements of the array.
  const results = arr.map(function (element) {
    return fields.slice().reduce(delve_, element);
  });
  return results;
}

Executing this yields the following output in Stackdriver:

Obviously you probably want some different (aka real) fields, and probably have other ideas for how to report them, so I leave that portion up to the reader.

Anyone with improvements to the above is welcome to submit a PR.

Pulling PubMed data into Google Sheets

1 Answers