Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim
and rtrim
(search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions
first. Here is an example:
from pyspark.sql import SQLContext
from pyspark.sql.functions import *
sqlContext = SQLContext(sc)
df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings
df.collect()
# [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')]
df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1
df.collect()
# [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')]
df = df.withColumn('d1', rtrim(df.d1)) # trim right whitespace from d1
df.collect()
# [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]