coming from Pandas background, I am struggling with Spark 2.0.2 (PySpark) on a simple operation. How to add a new column to an existing Data Frame, which contains the first two character of an existing column. Ideally what I want essentially is
df.withColumn("new_column", col('FirstName')[:2])
which obviously Spark 2.0.2 cannot understand.
Data Source
df = sqlContext.createDataFrame([("Ranjeet", 4, "California", 2), ("Anthony", 5, "Hawaii", 3), ("Diana", 6, "Singapore", 4)], ["FirstName", "Sales", "State", "Cost"])
df.show()
Attempt #1
from pyspark.sql.functions import col
df.withColumn("new_column", col('FirstName').map(lambda x: x[0][:2]))
Error: Column object is not callable
Attemp #2 - creates a new data frame containing the first two chars of FirstName but how do I merge it back to df?
from pyspark.sql import Row
row = Row("block_x")
df.select('FirstName').rdd.map(lambda x: x[0][:2]).map(row).toDF().show()
Thank you for your time.
Looking for a simple and elegant solution. May be I am missing something here.