5
votes

In PySpark 1.6 DataFrame currently there is no Spark builtin function to convert from string to float/double.

Assume, we have a RDD with ('house_name', 'price') with both values as string. You would like to convert, price from string to float. In PySpark, we can apply map and python float function to achieve this.

New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price))    # this works

In PySpark 1.6 Dataframe, it does not work:

New_DF = rawdataDF.select('house name', float('price')) # did not work

Until a built in Pyspark function available, how to do achieve this conversion with a UDF? I developed this conversion UDF as follows:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def string_to_float(x):
    return float(x)

udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))

Is there a better and much simpler way to achieve the same?

2

2 Answers

5
votes

According to the documentation, you can use the cast function on a column like this:

rawdata.withColumn("house name", rawdata["price"].cast(DoubleType()).alias("price"))
2
votes

The answer should be as follows:

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: string (nullable = true)

>>> rawdata=rawdata.withColumn('price',rawdata['price'].cast("float").alias('price'))

>>> rawdata.printSchema()
root
 |-- house name: string (nullable = true)
 |-- price: float (nullable = true)

As it is the shortest one-line code without using any user-defined function. You can see whether it worked correctly by using printSchema() function.