I have a csv file like this :
0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.
My goal is to use Decision trees in order to predict the last column (either normal or something else)
As you can see, not all the fields from my csv file are the same type, there are strings, int and double.
At first I wanted to create a RDD and use it like this :
def load_part1(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int, Int, Int, Double, Double, Double, Double, Double, Double, Double, Int, Int, Double, Double, Double, Double, Double, Double, Double, Double, String)]
val data = context.textFile(file)
val res = data.map(x => {
val s = x.split(",")
(s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
.persist(StorageLevel.MEMORY_AND_DISK)
return res
}
But it won't accept it because a tuple cannot have more than 22 fields in scala.
And now I am stuck because I don't know how to load an parse my csv file to use it as training and test for the decision tree.
When i look at the decision tree examples on spark doc, they use libsvm format : is this the only format I can use ? Because the thing is that:
- not all my features have the same type : do I need to convert all the features into the same type ?
- My labels are not integers but strings, so do I need to convert my labels to integers in order to use decision tree classifier ?
I tried to look at some topic like this one or this one but it is quite different as for the first link all of his features have the same format (double) and for the second I have tried to load and parse my data like this :
val csv = context.textFile("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected") // original file
val data = csv.map(line => line.split(",").map(elem => elem.trim))
But it took almost 2 min for my computer to do it, besides it made it crash ?!
I am thinking about programming a little python code in order to change all the string format into integers so that I could apply a CSV2LibSVM python code and then use the decision tree classifier like the example on the spar documentation, but is it really necessary? Can't I directly use my csv file ?
I am a newbie at scala and spark :) Thank you