Pyspark most reliable way to verify column type

Question

If I read data from a CSV, all the columns will be of "String" type by default. Generally I inspect the data using the following functions which gives an overview of the data and its types

df.dtypes
df.show()
df.printSchema()
df.distinct().count()
df.describe().show()

But, if there is a column that I believe is of a particular type e.g. Double, I cannot be sure if all the values are double if I don't have business knowledge and because

1- I cannot see all the values (millions of unique values) 2- If I explicitly cast it to double type, spark quietly converts the type without throwing any exception and the values which are not double are converted to "null" - for example

from pyspark.sql.types import DoubleType.

changedTypedf = df_original.withColumn('label', df_control_trip['id'].cast(DoubleType()))

What could be the best way to confirm the type of column then?

When reading from CSV you can set inferSchema=True and it will try to figure out the types for each column, — pault

pasha701 pasha701 · Accepted Answer · 2018-09-21T06:27:58

In Scala Dataframe has field "schema", guess, in Python the same:

df.schema.fields.find( _.name=="label").get.dataType

Pyspark most reliable way to verify column type

1 Answers