1
votes

Experts, i have a simple requirement but not able to find the function to achieve the goal.

I am using pyspark (spark 1.6 & Python 2.7) and have a simple pyspark dataframe column with certain values like-

1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC
1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234
1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678

The common thing about these values is that there is a single "underscore" and after that there are certain characters (can be any number of characters). These are the characters i am interested to get in the output. I want to use a substring or regex function which will find the position of "underscore" in the column values and select "from underscore position +1" till the end of column value. So the output will look like a dataframe with values as-

ABC
1234
12345678

I tried using sub-string but could find anything to "index" the "underscore"

Thanks!

2
so you want string from first underscore to the end of the line?Piyush Patel

2 Answers

2
votes

You can use regexp_extract to extract data after _

Example:

from pyspark.sql.functions import *

df=spark.sql("""select stack(3,"1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC","1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234","1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678") as (txt)""")

df.withColumn("extract",regexp_extract(col("txt"),"_(.*)",1)).show(10,False)

Result:

+-------------------------------------------------------------------+--------+
|txt                                                                |extract |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+
4
votes

No need to use any regexp

Please try below as shown. Basically Splitting on _ char and getting second item via getItem()

>>> from pyspark.sql import functions as F
>>> my_list = [("1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC",),("1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234",),("1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678",)]
>>> my_df = spark.createDataFrame(my_list, schema=["input_v"])
>>> my_df.withColumn("get_val", F.split(F.col("input_v"),"_").getItem(1)).show(20,False)

Result

+-------------------------------------------------------------------+--------+
|input_v                                                            |get_val |
+-------------------------------------------------------------------+--------+
|1849adb0-gfhe6543-bduyre763ryi-hjdsgf87qwefdb-78a9f4811265_ABC     |ABC     |
|1849adb0-rdty4545y4-657u5h556-zsdcafdqwddqdas-78a9f4811265_1234    |1234    |
|1849adb0-89o8iulk89o89-89876h5-432rebm787rrer-78a9f4811265_12345678|12345678|
+-------------------------------------------------------------------+--------+```