1
votes

In a google spreadsheet I would like to filter the importxml() function, because the xml source document isn't so well formed (some sub-items are variably present) so that my columns alignment is not good (different types of data are mixed). see picture The query syntax I'm using has no effet. I've tried to select or to exclude, and both. Maybe you could help me. Thank you very much.

=importxml(A1;"*/*[name()='NameId'] | */*[name()='ScientificName'] | */*[name()!='DisplayDate'] | */*[name()!='Family'] | */*[name()!='RankAbbreviation'] | */*[name()!='NomenclatureStatusID'] | */*[name()!='NomenclatureStatusName'] | */*[name()!='TotalRows']")

Here is the sheet (french parameters) : https://docs.google.com/spreadsheets/d/1wY_rt9ZRIMesXDFX_DoN-FTwcBkhNaugGgWd0X2bE4g/edit?usp=sharing

3

3 Answers

1
votes

Issue and workaround:

When I saw the XML data of the URL, it seems that there is the case that the names in each child are different. I think that the reason of your issue is due to this. Unfortunately, it seems that at IMPORTXML and XPath, when the name is not existing, it cannot be directly replaced with the empty value. So, in this case, as a workaround for achieving your goal, I would like to propose to use a custom function created by XmlService of Google Apps Script instead of IMPORTXML.

The flow of the sample script is as follows.

  1. Retrieve XML data from URL.
  2. Parse XML data.
  3. Retrieve header row.
  4. Create an result array, and put it.

When above flow is reflected to a sample script, it becomes as follows.

Sample script:

Please copy and paste the following script to the script editor of Spreadsheet. And, please put a custom formula of =SAMPLE("http://services.tropicos.org/Name/Search?name=adonis&type=wildcard&apikey=7602bfa6-cd59-4029-a28d-3aeb0ff8836e&format=xml"). By this, all values are parsed and shown in the cells.

function SAMPLE(url, ignoreHeaders, orderedHeaders) {
  // 1. Retrieve XML data from URL.
  const res = UrlFetchApp.fetch(url);
  
  // 2. Parse XML data.
  const xmlObj = XmlService.parse(res.getContentText());
  const root = xmlObj.getRootElement();
  const names = root.getChildren();

  // 3. Retrieve header row.
  let h = [];
  if (orderedHeaders) {
    h = orderedHeaders.split(",").map(e => e.trim());
  } else {
    const hObj = names.reduce((o, e) => {
      e.getChildren().forEach(f => {
        o[f.getName()] = true
      });
      return o;
    }, {});
    if (ignoreHeaders) {
      ignoreHeaders.split(",").forEach(e => {
        if (hObj[e.trim()]) delete hObj[e.trim()];
      });
    }
    h = Object.keys(hObj);
  }

  // 4. Create an result array.
  const ns = root.getNamespace();
  const result = names.reduce((ar, e) => {
    const temp = h.map(h => {
      const t = e.getChild(h, ns);
      if (t) {
        const v = t.getValue();
        return isNaN(v) ? v : Number(v);
      }
      return "";
    });
    ar.push(temp);
    return ar;
  }, [h]);
  return result;
}
  • When =SAMPLE("URL") is put to a cell, all values are retrieved.
  • ignoreHeaders is used as the header you want to ignore. This is from */*[name()!='DisplayDate'] | */*[name()!='Family'] | */*[name()!='RankAbbreviation'] | */*[name()!='NomenclatureStatusID'] | */*[name()!='NomenclatureStatusName'] | */*[name()!='TotalRows'] in your xpath.
    • When you want to ignore the headers of DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows, please put a custom function like =SAMPLE("URL";"DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows"). By this, these headers are ignored from the retrieved values.
  • orderedHeaders is used as the custom ordered headers.
    • When you want to retrieve the values with the custom ordered headers like NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows, please put a custom function like =SAMPLE("URL";;"NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows")

Result:

When the URL of http://services.tropicos.org/Name/Search?name=adonis&type=wildcard&apikey=7602bfa6-cd59-4029-a28d-3aeb0ff8836e&format=xml is put to a cell "A1",

Pattern 1:

When =SAMPLE(A1) to a cell "B1", the following result is obtained.

enter image description here

Pattern 2:

When =SAMPLE(A1;"DisplayDate,Family,RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows") to a cell "B1", the following result is obtained.

enter image description here

Pattern 3:

When =SAMPLE(A1;;"NameId,ScientificName,ScientificNameWithAuthors,Family,RankAbbreviation,NomenclatureStatusName,Author,DisplayReference,DisplayDate,TotalRows") to a cell "B1", the following result is obtained.

enter image description here

References:

1
votes

try:

=QUERY(IMPORTXML(A1;
 "*/*[name()='NameId'] | 
  */*[name()='ScientificName'] |
  */*[name()!='DisplayDate'] |
  */*[name()!='Family'] |
  */*[name()!='RankAbbreviation'] |
  */*[name()!='NomenclatureStatusID'] |
  */*[name()!='NomenclatureStatusName'] |
  */*[name()!='TotalRows']");
 "where Col10 matches '^\d+$'"; 0)

enter image description here

0
votes

In order to overpass the 30 seconds limitation time allowed by google sheet for the nice solution given by Tanaike, I have added a call from function onOpen() where the url is given by cell A1. Thank you again

function onOpen(){
    var classeur = SpreadsheetApp.getActive().getActiveSheet();
    var xmlOK = classeur.getRange(2,1);
  if (xmlOK.isBlank()){ 
    var url = classeur.getRange(1,1).getValue();
    var ignoreHeaders = "RankAbbreviation,NomenclatureStatusID,NomenclatureStatusName,TotalRows,Symbol";
    var formule = '=FILTERXML("'+url+'";"'+ignoreHeaders+'")';
    classeur.getRange(2,1).setFormula(formule);
  }
}