3
votes

I would like to create a spark dataframe in pyspark from a text file, that has different number of rows and columns and map it to key/value pair, the key is the first 4 characters from the first column of the text file. I want to do that in order to remove the redundant rows and to be able group them later by the key value. I know how to do that on pandas but still confused where to start doing that in pyspark.

My input is a text file that has the following:

  1234567,micheal,male,usa
  891011,sara,femal,germany

I want to be able to group every row by the first six characters in the first column

1

1 Answers

1
votes

Create a new column that contains only the first six characters of the first column, and then group by that:

from pyspark.sql.functions import col
df2 = df.withColumn("key", col("first_col")[:6])
df2.groupBy("key").agg(...)