Create new dataset using existing dataset by adding null column in-between two columns

Question

I created a dataset in Spark using Java by reading a csv file. Following is my initial dataset:

+---+----------+-----+---+
|_c0|       _c1|  _c2|_c3|
+---+----------+-----+---+
|  1|9090999999|NANDU| 22|
|  2|9999999999| SANU| 21|
|  3|9999909090| MANU| 22|
|  4|9090909090|VEENA| 23|
+---+----------+-----+---+

I want to create dataframe as follows (one column having null values):

+---+----+--------+
|_c0| _c1|     _c2|
+---+----|--------+
|  1|null|   NANDU|
|  2|null|    SANU|
|  3|null|    MANU|
|  4|null|   VEENA|
+---+----|--------+

Following is my existing code:

Dataset<Row> ds  = spark.read().format("csv").option("header", "false").load("/home/nandu/Data.txt");
Column [] selectedColumns = new Column[2];
selectedColumns[0]= new Column("_c0");
selectedColumns[1]= new Column("_c2");
ds2 = ds.select(selectedColumns);

which will create dataset as follows.

+---+-----+
|_c0|  _c2|
+---+-----+
|  1|NANDU|
|  2| SANU|
|  3| MANU|
|  4|VEENA|
+---+-----+

Shaido Shaido · Accepted Answer · 2019-01-04T06:52:33

To select the two columns you want and add a new one with nulls you can use the following:

import org.apache.spark.sql.functions.*;
import org.apache.spark.sql.types.StringType;

ds.select({col("_c0"), lit(null).cast(DataTypes.StringType).as("_c1"), col("_c2")});

Create new dataset using existing dataset by adding null column in-between two columns

3 Answers