Pyspark dataframe Column Sub-string based on the index value of a particular character

Question

Experts, i have a simple requirement but not able to find the function to achieve the goal.

I am using pyspark (spark 1.6 & Python 2.7) and have a simple pyspark dataframe column with certain values like-

1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC
1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234
1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678

The common thing about these values is that there is a single "underscore" and after that there are certain characters (can be any number of characters). These are the characters i am interested to get in the output. I want to use a substring or regex function which will find the position of "underscore" in the column values and select "from underscore position +1" till the end of column value. So the output will look like a dataframe with values as-

ABC
1234
12345678

I tried using sub-string but could find anything to "index" the "underscore"

Thanks!

so you want string from first underscore to the end of the line? — Piyush Patel

notNull notNull · Accepted Answer · 2019-10-18T20:32:52

You can use regexp_extract to extract data after _

Example:

from pyspark.sql.functions import *

df=spark.sql("""select stack(3,"1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC","1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234","1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678") as (txt)""")

df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).show(10,False)

Result:

+-------------------------------------------------------------------+--------+
|txt                                                                |extract |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+

Pyspark dataframe Column Sub-string based on the index value of a particular character

2 Answers