I'm using Apache NiFi 1.5.0 and I need to split incoming files based on their content, so not on byte or line count. There could even be rows that should be discarded.
Suppose this is the incoming file (START
is the known split point, next lines may start with different words):
GARBAGE LINE 1234
START 53534
HIGHDATA 22
LOWDATA 885
START 1563632
HIGHDATA 252
HIGHDATA 20548
LOWDATA 240240
The first line must be discarded, then the content should be divided in two FlowFiles.
First Flowfile:
START 53534
HIGHDATA 22
LOWDATA 885
Second FlowFile:
START 1563632
HIGHDATA 252
HIGHDATA 20548
LOWDATA 240240
My idea is to apply a regular expression to each line of the FlowFile content and if it matches the current FlowFile is ended and sent out of the processor, then another FlowFile is created while the rest of the content is processed.
But I can't set up any component that would let me achieve that:
RouteText
works line per line, splitting too muchSplitText
splits based on row count or content length
The only processor that seems to do something is SplitRecord
with a GrokExtract
as reader, but then I can't get any writer to work by simply outputting plain text.