I'm attempting to cast multiple String columns to integers in a dataframe using PySpark 2.1.0. The data set is a rdd to begin, when created as a dataframe it generates the following error:
TypeError: StructType can not accept object 3 in type <class 'int'>
A sample of what I'm trying to do:
import pyspark.sql.types as typ
from pyspark.sql.functions import *
labels = [
('A', typ.StringType()),
('B', typ.IntegerType()),
('C', typ.IntegerType()),
('D', typ.IntegerType()),
('E', typ.StringType()),
('F', typ.IntegerType())
]
rdd = sc.parallelize(["1", 2, 3, 4, "5", 6])
schema = typ.StructType([typ.StructField(e[0], e[1], False) for e in labels])
df = spark.createDataFrame(rdd, schema)
df.show()
cols_to_cast = [dt[0] for dt in df.dtypes if dt[1]=='string']
#df2 = df.select(*(c.cast("integer").alias(c) for c in cols_to_cast))
df2 = df.select(*( df[dt[0]].cast("integer").alias(dt[0])
for dt in df.dtypes if dt[1]=='string'))
df2.show()
The problem to begin with is the dataframe is not being created based on the RDD. Thereafter, I have tried two ways to cast (df2), the first is commented out.
Any suggestions? Alternatively is there anyway I could use the .withColumn functions for casting all columns in 1 go, instead of specifying each column? The actual dataset, although not large, has many columns.
map
to a new RDD with your casted columns – OneCricketeer