0
votes

I'm trying to write a pretty simple XML file stored in HDFS to HBase. I'd like to transform the XML file into json format and create one row in HBase for each element within the json array. See following the XML structure:

<?xml version="1.0" encoding="UTF-8"?>
<customers>
<customer customerid="1" name="John Doe"></customer>
<customer customerid="2" name="Tommy Mels"></customer>
</customers>

And see following the desired HBase output rows:

1    {"customerid"="1","name"="John Doe"}
2    {"customerid"="2","name"="Tommy Mels"}

I've tried out many different processors for my flow but this is what I've got now: GetHDFS -> ConvertRecord -> SplitJson -> PutHBaseCell. The ConvertRecord is working fine and is converting the XML file to json format properly but I can't manage to split the json records into 2. See following what I've managed to write in HBase so far (with a different processors combination):

c5927a55-d217-4dc1-af04-0aff743 column=person:rowkey, timestamp=1574329272237, value={"customerid":"1","name":"John Doe"}\x0A{
 cfe4e                           "customerid":"2","name":"Tommy Mels"}

For the splitjson processor I'm using the following jsonpathexpression: $.*

As of now, I'm getting an IllegalArgumentException in the PutHBaseCell processor stating that the Row length is 0, see following the PutHBaseCell processor settings:

PutHBaseCell

Any hints?

1

1 Answers

0
votes

I think the issue is that SplitJson isn't working properly since technically the content of your flow file is multiple json documents, with one per-line. I think SplitJson would expect them to be inside an array like:

[
    {"customerid"="1","name"="John Doe"},
    {"customerid"="2","name"="Tommy Mels"}
]

One option is to use SplitRecord with a JsonTreeReader which should be able to understand the json-per-line format.

Another option is to avoid splitting all together and go from ConvertRecord -> PutHBaseRecord with a JsonTreeReader.