0
votes

My requirement is to cast all Decimal data type in DataFrame to String. Logic is working fine with simple type but not working with ArrayType. Here is logic :-

var df = spark.sql("select * from test_1")
for(dt <- df.dtypes) {
  if(dt._2.substring(0,7) == "Decimal"){
    df = df.withColumn(dt._1,df(dt._1).cast("String"))  
  }
}

But column within arrayType remains unchanged although, they are decimal type. please help , how can I loop through nested element and cast it to string. This is schema of my dataframe:

scala> df.schema res77: org.apache.spark.sql.types.StructType = StructType(StructField(mstr_prov_id,StringType,true), StructField(prov_ctgry_cd,StringType,true), StructField(prov_orgnl_efctv_dt,TimestampType,true), StructField(prov_trmntn_dt,TimestampType,true), StructField(prov_trmntn_rsn_cd,StringType,true), StructField(npi_rqrd_ind,StringType,true), StructField(prov_stts_aray_txt,ArrayType(StructType(StructField(PROV_STTS_KEY,DecimalType(22,0),true), StructField(PROV_STTS_EFCTV_DT,TimestampType,true), StructField(PROV_STTS_CD,StringType,true), StructField(PROV_STTS_TRMNTN_DT,TimestampType,true), StructField(PROV_STTS_TRMNTN_RSN_CD,StringType,true)),true),true))

3
You have to write a UDF which convert the each decimal In array to string.it because the type of the column is array of decimalsai pradeep kumar kotha
Can you share sample code for this requirement? I am looking for UDF which implements this task.Vinitkumar
Even I know that, I will require a UDF which will iterate through Array Elements, and cast all Decimal types to String. But I don't know how to write that code. Hence, looking for sample code(UDF) for same.Vinitkumar
you want to change all the data in prov_stts_aray_txt column to change to string ? is that so?Ramesh Maharjan
@RameshMaharjan , Only fields with decimal type in prov_stts_aray_txt, has to be converted to String. I tried a lot, but couldn't find any clue.Vinitkumar

3 Answers

1
votes

If you are using spark 2.1 and above then following casting should work for you

val newSchema = DataType.fromJson(df.schema.json.replaceAll("(decimal\\(\\d+,\\d+\\))", "string")).asInstanceOf[StructType]
df.select(newSchema.map(field => col(field.name).cast(field.dataType)): _*)

which should cast all the decimal types to string type.

But if you are using spark version lower than the mentioned and since there is timestamp datatype in the struct column you will encounter

TimestampType (of class org.apache.spark.sql.types.TimestampType$) scala.MatchError: TimestampType (of class org.apache.spark.sql.types.TimestampType$)

Its a casting structs fails on Timestamp fields and resolved cast struct with timestamp field fails

0
votes

you can also cast complex types, e.g. if you have a dataframe like this schema:

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- i: decimal(22,0) (nullable = true)
 |    |    |-- j: double (nullable = false)

you can cast all array-elements of type decimal (field i n this example) by doing:

df
  .select($"arr".cast("array<struct<i:string,j:double>>"))
  .printSchema()

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- i: string (nullable = true)
 |    |    |-- j: double (nullable = true)

EDIT: If you don't know the schema in advance, you can just replace the decimal in the original schema with string:

val arraySchema = df.schema.fields(0).dataType.simpleString
val castedSchema = arraySchema.replaceAll("decimal\\(.*\\)","string")

df
  .select($"arr".cast(castedSchema))
  .show()
0
votes

Try this (your comparison with == is probably not what you want)

var df = spark.sql("select * from test_1")
for(dt <- df.dtypes) {
  if("Decimal".equalsIgnoreCase(dt._2.substring(0,Math.min(7, dt._2.length)))){
    df = df.withColumn(dt._1,df(dt._1).cast("String"))  
  }
}