1
votes

In NiFi, I'm trying to utilise the SplitRecord processor to change incoming XML files to Json.

The structure of the XML file is as follows:

<?xml version="1.0" encoding="UTF-8"?>
<docs>
   <doc boost="1.0">
      <field name="id">12345</field>
      <field name="name">john smith</field>
      <field name="age">34</field>
   </doc>
   <doc boost="1.0">
      <field name="id">74832</field>
      <field name="name">jane doe</field>
      <field name="age">16</field>
   </doc>
   <doc boost="1.0">
      <field name="id">08423</field>
      <field name="name">henry jones</field>
      <field name="age">29</field>
   </doc>
</docs>

I want to split each record but I also was to have the attributes of the field represented instead of just field itself being one giant array with no identifier.

Right now, in the SplitRecord processor I'm using XMLreader controller and JSONsetrecordWriter controller. If I just let the XML reader infer the schema, I get the following result

{ "doc": [ {
    "field" : ["12345", "john smith", "34"]
    },
    {
    "field" : ["74832", "jane doe", "16"]
    },
    {
    "field" : ["08423", "henry jones", "29"]
    } ]
}

You can see all the fields have lost their metadata, I want something closer to this:

{ 
"id":"12345",
"name":"john smith",
"age":"34"
},
{ 
"id":"74832",
"name":"jane doe",
"age":"16"
},
{ 
"id":"08423",
"name":"henry jones",
"age":"29"
}

In the XMLReader controller, if I switch to use "Schema Text" property, and in the schema text property field I put in this schema:

{
  "type" : "record",
  "name" : "docs",
  "namespace" : "doc",
  "fields" : [ {
    "name" : "boost", "type" : "string"
  }, 
  {
    "name" : "field",
    "type" : {
        "type" : "array",
            "items" : {
              "type" : "record",
              "name" : "field",
              "namespace" : "name",
              "fields" : [ 
              {"name" : "id","type": "string"},
              {"name":"name","type":"string"},
              {"name":"age","type":"string"}
              ]
            }
    }
  }
  ]
}

I get no result, in fact the output is now this:

{"boost":null,"field":[]}

I would really appreciate any help in where my schema is wrong in trying to convert this XML to JSON.

Update

I've determined my working schema

{
  "name": "docs",
  "namespace": "doc",
  "type": "record",
  "fields": [
      {
         "name" : "field",
         "type" : {
         "type" : "array",
            "items" : {
              "type" : "record",
              "name" : "field",
              "namespace" : "name",
              "fields" : [ 
                          {"name": "name", "type": "string"},
                          {"name": "value", "type": "string"}
                         ]
                      }
                  }
       }
     ]
}

in the xmlreader controller, also set the following properties:

Schema Access Strategy: "Use 'Schema Text' Property"

Schema Text: the above schema

Expect Records as Array: true

Field Name for Content: value

1

1 Answers

0
votes

One way you can accomplish this is to create a second schema that matches your desired output. Then use JoltTransformRecord to run a Jolt transformation. You would want to do a two-step for that like this:

Convert/SplitRecord -> JoltTransformRecord

Also, for your use case described, use ConvertRecord instead of SplitRecord unless you have A LOT of records and just want to split things up. If you're splitting records down to 1 record a piece, that's a very inefficient flow. The Record API can handle anywhere from 1 record at a time to several million records easily.