Parsing CSV file for decision tree classifier in spark

Question

I have a csv file like this :

0,tcp,http,SF,181,5450,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,9,9,1.00,0.00,0.11,0.00,0.00,0.00,0.00,0.00,normal.

My goal is to use Decision trees in order to predict the last column (either normal or something else)

As you can see, not all the fields from my csv file are the same type, there are strings, int and double.

At first I wanted to create a RDD and use it like this :

def load_part1(file: String): RDD[(Int, String, String,String,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int,Int, Int, Int, Double, Double, Double, Double, Double, Double, Double, Int, Int, Double, Double, Double, Double, Double, Double, Double, Double, String)] 
        val data = context.textFile(file)
        val res = data.map(x => {
            val s = x.split(",")
            (s(0).toInt, s(1), s(2), s(3), s(4).toInt, s(5).toInt, s(6).toInt, s(7).toInt, s(8).toInt, s(9).toInt, s(10).toInt, s(11).toInt, s(12).toInt, s(13).toInt, s(14).toInt, s(15).toInt, s(16).toInt, s(17).toInt, s(18).toInt, s(19).toInt, s(20).toInt, s(21).toInt, s(22).toInt, s(23).toInt, s(24).toDouble, s(25).toDouble, s(26).toDouble, s(27).toDouble, s(28).toDouble, s(29).toDouble, s(30).toDouble, s(31).toInt, s(32).toInt, s(33).toDouble, s(34).toDouble, s(35).toDouble, s(36).toDouble, s(37).toDouble, s(38).toDouble, s(39).toDouble, s(40).toDouble, s(41))
})
        .persist(StorageLevel.MEMORY_AND_DISK)
    return res
    }

But it won't accept it because a tuple cannot have more than 22 fields in scala.

And now I am stuck because I don't know how to load an parse my csv file to use it as training and test for the decision tree.

When i look at the decision tree examples on spark doc, they use libsvm format : is this the only format I can use ? Because the thing is that:

not all my features have the same type : do I need to convert all the features into the same type ?
My labels are not integers but strings, so do I need to convert my labels to integers in order to use decision tree classifier ?

I tried to look at some topic like this one or this one but it is quite different as for the first link all of his features have the same format (double) and for the second I have tried to load and parse my data like this :

 val csv = context.textFile("/home/hvfd8529/Datasets/KDDCup99/kddcup.data_10_percent_corrected")  // original file
 val data = csv.map(line => line.split(",").map(elem => elem.trim))

But it took almost 2 min for my computer to do it, besides it made it crash ?!

I am thinking about programming a little python code in order to change all the string format into integers so that I could apply a CSV2LibSVM python code and then use the decision tree classifier like the example on the spar documentation, but is it really necessary? Can't I directly use my csv file ?

I am a newbie at scala and spark :) Thank you

Aasiz Aasiz · Accepted Answer · 2017-06-02T15:42:45

Here is how you can do it in spark 2.1 First define the schema for your csv

            StructType schema = new StructType(new StructField[]{
                            new StructField("col1", DataTypes.StringType, true, Metadata.empty()),
                            new StructField("col2", DataTypes.DoubleType, true, Metadata.empty())})
            Dataset<Row> dataset = spark.read().format("csv").load("data.csv");
            StringIndexerModel indexer = new StringIndexer()
                            .setInputCol("col1")
                            .setOutputCol("col1Indexed").setHandleInvalid("skip").fit(data);
                    VectorAssembler assembler = new VectorAssembler()
                            .setInputCols(new String[]{"col1Indexed","col2"})
                            .setOutputCol("features");
    
        //Prepare data
        Dataset<Row>[] splits = data.randomSplit(new double[]{0.7, 0.3});
                Dataset<Row> trainingData = splits[0];
                Dataset<Row> testData = splits[1];
        
                DecisionTreeRegressor dt = new DecisionTreeRegressor().setFeaturesCol("features").setLabelCol("commission").setPredictionCol("prediction");
        
        Pipeline pipeline = new Pipeline()
                        .setStages(new PipelineStage[]{indexer,assembler, dt});
        
                // Train model. This also runs the indexer.
                PipelineModel model = pipeline.fit(trainingData);
        
                // Make predictions.
                Dataset<Row> predictions = model.transform(testData);

Basically, You have to index your string features using StringIndexer and use VectorAssembler to merge the new columns. (the code is in java but I think its pretty straightforward)

Parsing CSV file for decision tree classifier in spark

2 Answers