Round Spark DataFrame in-place

Question

I read a .csv file to Spark DataFrame. For a DoubleType column is there a way to specify at the time of the file read that this column should be rounded to 2 decimal places? I'm also supplying a custom schema to the DataFrameReader API call. Here's my schema and API calls:

val customSchema = StructType(Array(StructField("id_1", IntegerType, true),
            StructField("id_2", IntegerType, true), 
            StructField("id_3", DoubleType, true)))

#using Spark's CSV reader with custom schema    
#spark == SparkSession()
val parsedSchema = spark.read.format("csv").schema(customSchema).option("header", "true").option("nullvalue", "?").load("C:\\Scala\\SparkAnalytics\\block_1.csv")

After the file read into DataFrame I can round the decimals like:

parsedSchema.withColumn("cmp_fname_c1", round($"cmp_fname_c1", 3))

But this creates a new DataFrame, so I'd also like to know if it can be done in-place instead of creating a new DataFrame.

Thanks

In-place changes are not allowed in Spark Dataframes. They are immutable. — philantrovert
Is there any specific reason why you think creating a new Dataframe is from existing Dataframe an issue for you? — wandermonk
Spark dataframes are immutable and any operation which transforms the existing dataframe creates a new dataframe. — wandermonk
Spend some time in understanding spark rather than asking questions. — wandermonk

Leo C Leo C · Accepted Answer · 2018-05-01T05:52:36

You can specify, say, DecimalType(10, 2) for the DoubleType column in your customSchema when loading your CSV file. Let's say you have a CSV file with the following content:

id_1,id_2,Id_3
1,10,5.555
2,20,6.0
3,30,7.444

Sample code below:

import org.apache.spark.sql.types._

val customSchema = StructType(Array(
  StructField("id_1", IntegerType, true),
  StructField("id_2", IntegerType, true), 
  StructField("id_3", DecimalType(10, 2), true)
))

spark.read.format("csv").schema(customSchema).
  option("header", "true").option("nullvalue", "?").
  load("/path/to/csvfile").
  show
// +----+----+----+
// |id_1|id_2|id_3|
// +----+----+----+
// |   1|  10|5.56|
// |   2|  20|6.00|
// |   3|  30|7.44|
// +----+----+----+

Round Spark DataFrame in-place

1 Answers