Pyspark: create new column by splitting text

Question

I have a pyspark dataframe like this:

spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac'), 
        (2, '1234ESPNzodiac'),
        (3, '963CNNnonzodiac'), 
        (4, '963CNNzodiac'),
    ],
    ['id', 'col1'] 
)

I would like to create a new column where I split col1 on the words zodiac or nonzodiac, so that I can eventually groupby this new column.

I would like the final output to be like this:

spark.createDataFrame(
    [
        (1, '1234ESPNnonzodiac', '1234ESPN'), 
        (2, '1234ESPNzodiac', '1234ESPN'),
        (3, '963CNNnonzodiac', '963CNN'), 
        (4, '963CNNzodiac', '963CNN'),
    ],
    ['id', 'col1', 'col2'] 
)

Czaporka Czaporka · Accepted Answer · 2020-11-02T19:49:39

I would use from pyspark.sql.functions import regexp_extract:

df.withColumn("col2", regexp_extract(df.col1, r"([\s\S]+?)(?:non)?zodiac", 1)).show()
+---+-----------------+--------+
| id|             col1|    col2|
+---+-----------------+--------+
|  1|1234ESPNnonzodiac|1234ESPN|
|  2|   1234ESPNzodiac|1234ESPN|
|  3|  963CNNnonzodiac|  963CNN|
|  4|     963CNNzodiac|  963CNN|
+---+-----------------+--------+

Pyspark: create new column by splitting text

1 Answers