1
votes

If I read data from a CSV, all the columns will be of "String" type by default. Generally I inspect the data using the following functions which gives an overview of the data and its types

  • df.dtypes
  • df.show()
  • df.printSchema()
  • df.distinct().count()
  • df.describe().show()

But, if there is a column that I believe is of a particular type e.g. Double, I cannot be sure if all the values are double if I don't have business knowledge and because

1- I cannot see all the values (millions of unique values) 2- If I explicitly cast it to double type, spark quietly converts the type without throwing any exception and the values which are not double are converted to "null" - for example

from pyspark.sql.types import DoubleType.

changedTypedf = df_original.withColumn('label', df_control_trip['id'].cast(DoubleType()))

What could be the best way to confirm the type of column then?

1
When reading from CSV you can set inferSchema=True and it will try to figure out the types for each column,pault

1 Answers

0
votes

In Scala Dataframe has field "schema", guess, in Python the same:

df.schema.fields.find( _.name=="label").get.dataType