In PySpark 1.6 DataFrame currently there is no Spark builtin function to convert from string to float/double.
Assume, we have a RDD with ('house_name', 'price') with both values as string. You would like to convert, price from string to float. In PySpark, we can apply map and python float function to achieve this.
New_RDD = RawDataRDD.map(lambda (house_name, price): (house_name, float(x.price)) # this works
In PySpark 1.6 Dataframe, it does not work:
New_DF = rawdataDF.select('house name', float('price')) # did not work
Until a built in Pyspark function available, how to do achieve this conversion with a UDF? I developed this conversion UDF as follows:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def string_to_float(x):
return float(x)
udfstring_to_float = udf(string_to_float, StringType())
rawdata.withColumn("house name", udfstring_to_float("price"))
Is there a better and much simpler way to achieve the same?