55
votes

I have dataframe in pyspark. Some of its numerical columns contain 'nan' so when I am reading the data and checking for the schema of dataframe, those columns will have 'string' type. How I can change them to int type.I replaced the 'nan' values with 0 and again checked the schema, but then also it's showing the string type for those columns.I am following the below code:

data_df = sqlContext.read.format("csv").load('data.csv',header=True, inferSchema="true")
data_df.printSchema()
data_df = data_df.fillna(0)
data_df.printSchema()

my data looks like this: enter image description here

here columns 'Plays' and 'drafts' containing integer values but because of nan present in these columns,they are treated as string type.

3
Is there any way by which while reading the data only we can replace the nan so that in schema those columns will be treated as int type. - neha
you will have to make the whole column to be a integer in your case I believe. - Chetan_Vasudevan

3 Answers

107
votes
from pyspark.sql.types import IntegerType
data_df = data_df.withColumn("Plays", data_df["Plays"].cast(IntegerType()))
data_df = data_df.withColumn("drafts", data_df["drafts"].cast(IntegerType()))

You can run loop for each column but this is the simplest way to convert string column into integer.

15
votes

You could use cast(as int) after replacing NaN with 0,

data_df = df.withColumn("Plays", df.call_time.cast('float'))
4
votes

Another way to do it is using the StructField if you have multiple fields that needs to be modified.

Ex:

from pyspark.sql.types import StructField,IntegerType, StructType,StringType
newDF=[StructField('CLICK_FLG',IntegerType(),True),
       StructField('OPEN_FLG',IntegerType(),True),
       StructField('I1_GNDR_CODE',StringType(),True),
       StructField('TRW_INCOME_CD_V4',StringType(),True),
       StructField('ASIAN_CD',IntegerType(),True),
       StructField('I1_INDIV_HHLD_STATUS_CODE',IntegerType(),True)
       ]
finalStruct=StructType(fields=newDF)
df=spark.read.csv('ctor.csv',schema=finalStruct)

Output:

Before

root
 |-- CLICK_FLG: string (nullable = true)
 |-- OPEN_FLG: string (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: string (nullable = true)

After:

root
 |-- CLICK_FLG: integer (nullable = true)
 |-- OPEN_FLG: integer (nullable = true)
 |-- I1_GNDR_CODE: string (nullable = true)
 |-- TRW_INCOME_CD_V4: string (nullable = true)
 |-- ASIAN_CD: integer (nullable = true)
 |-- I1_INDIV_HHLD_STATUS_CODE: integer (nullable = true)

This is slightly a long procedure to cast , but the advantage is that all the required fields can be done.

It is to be noted that if only the required fields are assigned the data type, then the resultant dataframe will contain only those fields which are changed.