If I read data from a CSV, all the columns will be of "String" type by default. Generally I inspect the data using the following functions which gives an overview of the data and its types
- df.dtypes
- df.show()
- df.printSchema()
- df.distinct().count()
- df.describe().show()
But, if there is a column that I believe is of a particular type e.g. Double, I cannot be sure if all the values are double if I don't have business knowledge and because
1- I cannot see all the values (millions of unique values) 2- If I explicitly cast it to double type, spark quietly converts the type without throwing any exception and the values which are not double are converted to "null" - for example
from pyspark.sql.types import DoubleType.
changedTypedf = df_original.withColumn('label', df_control_trip['id'].cast(DoubleType()))
What could be the best way to confirm the type of column then?
inferSchema=True
and it will try to figure out the types for each column, – pault