3
votes

I'm doing a small project for university using Apache NiFi and Apache Spark. I want to create a workflow with NiFi that reads TSV files from HDFS and using Spark Streaming I can process the files and store the information I need in MySQL. I've already created my Workflow in NiFi and the storage part is already working. The problem is that i can't parse the NiFi package so i can use them.

The files contain rows like this:

linea1File1 TheReceptionist 653 Entertainment   424 13021   4.34    1305    744 DjdA-5oKYFQ NxTDlnOuybo c-8VuICzXtU

Where each space is a tab ("\t")

This is my code in Spark using Scala:

 val ssc = new StreamingContext(config, Seconds(10))
 val packet = ssc.receiverStream(new NiFiReceiver(conf, StorageLevel.MEMORY_ONLY))
 val file = packet.map(dataPacket => new String(dataPacket.getContent, StandardCharsets.UTF_8))

Until here I can obtain my entire file (7000+ rows) in a single string... unfortunately i can't split that string in rows. I need to get the entire file in rows, so I can parse that in an object, apply some operations on it and store what I want

Anyone can help me with this?

1

1 Answers

4
votes

Each data packet is going to be the content of one flow file from NiFi, so if NiFi picks up one TSV file from HDFS that has a lot of rows, all those rows will be in one data packet.

It is hard to say without seeing your NiFi flow, but you could probably use SplitText with a line count of 1 to split your TSV in NiFi before it gets to spark streaming.