1
votes

I am trying to read a text file(.gz) with Spark 2.0/SparkSession.

The field seprator is ';'. First few fields are being loaded properly, but the last few fields where the data doesn't exists are no being read by spark.

For example, until ...h;7 is being read by spark,but not after that...Null fileds are being handled if they are before h;7;.

Can i know why is spark ignoring the last fields???

File Format:
1;2;6;;;;;h;7;;;;;;;;;

Code:

JavaRDD<mySchema> peopleRDD = spark.read()
      .textFile("file:///app/home/emm/zipfiles/myzips/")
      .javaRDD()
      .map(new Function<String, mySchema>()
        {
            @Override
            public mySchema call(String line) throws Exception
                {

                    String[] parts = line.split(";");
                    mySchema mySchema = new mySchema();

                    mySchema.setCFIELD1       (parts[0]);

                    mySchema.setCFIELD2       (parts[1]);
                    mySchema.setCFIELD3       (parts[2]);
                    mySchema.setCFIELD4       (parts[3]);
                    mySchema.setCFIELD5       (parts[4]);
                    ................................
                    ................................
                return mySchema;

                  }
        });
1

1 Answers

1
votes

The issue is with my Java code:

2nd argument of -1 to split method will take care of this.

                String[] parts = line.split(";",-1);