Trim string column in PySpark dataframe

29

votes

After creating a Spark DataFrame from a CSV file, I would like to trim a column. I've tried:

df = df.withColumn("Product", df.Product.strip())

df is my data frame, Product is a column in my table

But I get the error:

Column object is not callable

Any suggestions?

apache-sparkpysparkapache-spark-sqltrim

29

votes

from pyspark.sql.functions import trim

df = df.withColumn("Product", trim(col("Product")))

19

votes

Starting from version 1.5, Spark SQL provides two specific functions for trimming white space, ltrim and rtrim (search for "trim" in the DataFrame documentation); you'll need to import pyspark.sql.functions first. Here is an example:

 from pyspark.sql import SQLContext
 from pyspark.sql.functions import *
 sqlContext = SQLContext(sc)

 df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2']) # create a dataframe - notice the extra whitespaces in the date strings
 df.collect()
 # [Row(d1=u' 2015-04-08 ', d2=u' 2015-05-10 ')]
 df = df.withColumn('d1', ltrim(df.d1)) # trim left whitespace from column d1
 df.collect()
 # [Row(d1=u'2015-04-08 ', d2=u' 2015-05-10 ')]
 df = df.withColumn('d1', rtrim(df.d1))  # trim right whitespace from d1
 df.collect()
 # [Row(d1=u'2015-04-08', d2=u' 2015-05-10 ')]

16

votes

The pyspark version of the strip function is called trim; it will

Trim the spaces from both ends for the specified string column.

Make sure to import the function first and to put the column you are trimming inside your function.

The following should work:

from pyspark.sql.functions import trim
df = df.withColumn("Product", trim(df.Product))

5

votes

I did that with the udf like this:

from pyspark.sql.functions import udf

def trim(string):
    return string.strip()
trim=udf(trim)

df = sqlContext.createDataFrame([(' 2015-04-08 ',' 2015-05-10 ')], ['d1', 'd2'])

df2 = df.select(trim(df['d1']).alias('d1'),trim(df['d2']).alias('d2'))

output looks like this:

df.show()
df2.show()
+------------+------------+
|          d1|          d2|
+------------+------------+
| 2015-04-08 | 2015-05-10 |
+------------+------------+

+----------+----------+
|        d1|        d2|
+----------+----------+
|2015-04-08|2015-05-10|
+----------+----------+

5

votes

If you need to do it for all columns in the dataframe.

from pyspark.sql import functions as f

for colname in df.columns:
    df = df.withColumn(colname, f.trim(f.col(colname)))

Trim string column in PySpark dataframe

5 Answers