How to convert DataFrame columns from string to float/double in PySpark 1.6?

Question

In PySpark 1.6 DataFrame currently there is no Spark builtin function to convert from string to float/double.

Assume, we have a RDD with ('house_name', 'price') with both values as string. You would like to convert, price from string to float. In PySpark, we can apply map and python float function to achieve this.

New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price))    # this works

In PySpark 1.6 Dataframe, it does not work:

New_DF = rawdataDF.select('house name', float('price')) # did not work

Until a built in Pyspark function available, how to do achieve this conversion with a UDF? I developed this conversion UDF as follows:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
    return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))

Is there a better and much simpler way to achieve the same?

Alex Alex · Accepted Answer · 2016-02-28T22:40:42

According to the documentation, you can use the cast function on a column like this:

rawdata.withColumn("house name", rawdata["price"].cast(DoubleType()).alias("price"))

How to convert DataFrame columns from string to float/double in PySpark 1.6?

2 Answers