Casting string to ArrayType(DoubleType) pyspark dataframe

Question

I have a dataframe in spark with the following schema: schema:

StructType(List(StructField(id,StringType,true),
StructField(daily_id,StringType,true),
StructField(activity,StringType,true)))

Column activity is a String, sample content:

{1.33,0.567,1.897,0,0.78}

I need to cast column Activity to a ArrayType(DoubleType)

In order to get that done i have run the following command:

df = df.withColumn("activity",split(col("activity"),",\s*").cast(ArrayType(DoubleType())))

The new schema of the dataframe changed accordingly:

StructType(List(StructField(id,StringType,true),
StructField(daily_id,StringType,true),
StructField(activity,ArrayType(DoubleType,true),true)))

However, the data now looks like this: [NULL,0.567,1.897,0,NULL]

It changed the first and last element of the array of strings to NULL. I can't figure out why Spark is doing this with the dataframe.

Can please help here on what is the issue?

Many Thanks

Does this answer your question? Spark: Convert column of string to an array — mazaneicha

Srinivas Srinivas · Accepted Answer · 2020-06-12T11:43:36

Because

Below code is not replacing { & }

df.withColumn("activity",F.split(F.col("activity"),",\s*")).show(truncate=False)
+-------------------------------+
|activity                       |
+-------------------------------+
|[{1.33, 0.567, 1.897, 0, 0.78}]|
+-------------------------------+

When you try to cast these {1.33 & 0.78} string values to DoubleType you will get null as output.

df.withColumn("activity",F.split(F.col("activity"),",\s*").cast(ArrayType(DoubleType()))).show(truncate=False)
+----------------------+
|activity              |
+----------------------+
|[, 0.567, 1.897, 0.0,]|
+----------------------+

Change this

df.withColumn("activity",split(col("activity"),",\s*").cast(ArrayType(DoubleType())))

to

from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType
from pyspark.sql.types import DoubleType

df.select(F.split(F.regexp_replace(F.col("activity"),"[{ }]",""),",").cast("array<double>").alias("activity"))

Casting string to ArrayType(DoubleType) pyspark dataframe

4 Answers