1
votes

I am trying to read a csv file into an RDD in Spark (Using Scala). I have made a function to first filter data so that it doesn't take the header into consideration.

def isHeader(line: String): Boolean = {
line.contains("id_1")
}

and then I am running the following command:

val noheader = rawblocks.filter(x => !isHeader(x))

The rawblocks RDD reads data from a csv file which is 26MB in size

I am getting Task not serializable error. What can be the solution?

1
As above. TaskNotSerializable implies that something else in the class where your function is called is not serializable. If you provide more of the outer code we then can help. Also your stack should say which class is not serializable. - A Spoty Spot

1 Answers

0
votes

Most probably, you have defined your isHeader method inside a class which is not serializable. As a consequence, isHeader is tied to a non-serializable instance of said class, which is then shipped to executors via the closure.

You may want to either define isHeader in a separate object, or make the enclosing class serializable (which is not good practice, as you will still be shipping the entire class instance with your job, which is not intended).