In NiFi, I'm trying to utilise the SplitRecord processor to change incoming XML files to Json.
The structure of the XML file is as follows:
<?xml version="1.0" encoding="UTF-8"?>
<docs>
<doc boost="1.0">
<field name="id">12345</field>
<field name="name">john smith</field>
<field name="age">34</field>
</doc>
<doc boost="1.0">
<field name="id">74832</field>
<field name="name">jane doe</field>
<field name="age">16</field>
</doc>
<doc boost="1.0">
<field name="id">08423</field>
<field name="name">henry jones</field>
<field name="age">29</field>
</doc>
</docs>
I want to split each record but I also was to have the attributes of the field represented instead of just field itself being one giant array with no identifier.
Right now, in the SplitRecord processor I'm using XMLreader controller and JSONsetrecordWriter controller. If I just let the XML reader infer the schema, I get the following result
{ "doc": [ {
"field" : ["12345", "john smith", "34"]
},
{
"field" : ["74832", "jane doe", "16"]
},
{
"field" : ["08423", "henry jones", "29"]
} ]
}
You can see all the fields have lost their metadata, I want something closer to this:
{
"id":"12345",
"name":"john smith",
"age":"34"
},
{
"id":"74832",
"name":"jane doe",
"age":"16"
},
{
"id":"08423",
"name":"henry jones",
"age":"29"
}
In the XMLReader controller, if I switch to use "Schema Text" property, and in the schema text property field I put in this schema:
{
"type" : "record",
"name" : "docs",
"namespace" : "doc",
"fields" : [ {
"name" : "boost", "type" : "string"
},
{
"name" : "field",
"type" : {
"type" : "array",
"items" : {
"type" : "record",
"name" : "field",
"namespace" : "name",
"fields" : [
{"name" : "id","type": "string"},
{"name":"name","type":"string"},
{"name":"age","type":"string"}
]
}
}
}
]
}
I get no result, in fact the output is now this:
{"boost":null,"field":[]}
I would really appreciate any help in where my schema is wrong in trying to convert this XML to JSON.
Update
I've determined my working schema
{
"name": "docs",
"namespace": "doc",
"type": "record",
"fields": [
{
"name" : "field",
"type" : {
"type" : "array",
"items" : {
"type" : "record",
"name" : "field",
"namespace" : "name",
"fields" : [
{"name": "name", "type": "string"},
{"name": "value", "type": "string"}
]
}
}
}
]
}
in the xmlreader controller, also set the following properties:
Schema Access Strategy: "Use 'Schema Text' Property"
Schema Text: the above schema
Expect Records as Array: true
Field Name for Content: value