0
votes

coming from Pandas background, I am struggling with Spark 2.0.2 (PySpark) on a simple operation. How to add a new column to an existing Data Frame, which contains the first two character of an existing column. Ideally what I want essentially is

df.withColumn("new_column", col('FirstName')[:2])

which obviously Spark 2.0.2 cannot understand.

Data Source

df = sqlContext.createDataFrame([("Ranjeet", 4, "California", 2), ("Anthony", 5, "Hawaii", 3), ("Diana", 6, "Singapore", 4)], ["FirstName", "Sales", "State", "Cost"])
df.show()

Attempt #1

from pyspark.sql.functions import col
df.withColumn("new_column", col('FirstName').map(lambda x: x[0][:2]))

Error: Column object is not callable

Attemp #2 - creates a new data frame containing the first two chars of FirstName but how do I merge it back to df?

from pyspark.sql import Row
row = Row("block_x")
df.select('FirstName').rdd.map(lambda x: x[0][:2]).map(row).toDF().show()

Thank you for your time.

Looking for a simple and elegant solution. May be I am missing something here.

1

1 Answers

1
votes

You can use substring function and withColumn:

from pyspark.sql.functions import substring

df.withColumn("new_column", substring('FirstName', 0, 2))