2
votes

I'm using Apache NiFi 1.5.0 and I need to split incoming files based on their content, so not on byte or line count. There could even be rows that should be discarded.

Suppose this is the incoming file (START is the known split point, next lines may start with different words):

GARBAGE LINE 1234
START 53534
HIGHDATA 22
LOWDATA 885
START 1563632
HIGHDATA 252
HIGHDATA 20548
LOWDATA 240240

The first line must be discarded, then the content should be divided in two FlowFiles.

First Flowfile:

START 53534
HIGHDATA 22
LOWDATA 885

Second FlowFile:

START 1563632
HIGHDATA 252
HIGHDATA 20548
LOWDATA 240240

My idea is to apply a regular expression to each line of the FlowFile content and if it matches the current FlowFile is ended and sent out of the processor, then another FlowFile is created while the rest of the content is processed.

But I can't set up any component that would let me achieve that:

  • RouteText works line per line, splitting too much
  • SplitText splits based on row count or content length

The only processor that seems to do something is SplitRecord with a GrokExtract as reader, but then I can't get any writer to work by simply outputting plain text.

1

1 Answers

2
votes

The SplitContent processor does what I'm looking for. It lets me specify a (text) sequence of bytes and produces a new FlowFile for each split sequence encountered.

Garbage rows can be excluded with a RouteText processor, configured with a custom property that holds a regular expression matching the desired lines.