Need to convert Informatica reg_extract expression to Pyspark dataframe

Question

I have a scenario where I need to convert Informatica mapping (source and target SQL Server) into Pyspark code (source blob file and target Hive). In expression transformation one column contains 'reg_extract' function and I need to convert this to Pyspark dataframe. My final goal is to create the same table in Hive as it is in SQL Server.

What will be the replacement for reg_extract function in Pyspark? I am using Pyspark 2.

Below is the code from Informatica Expression transformation (for one column variable field):

LTRIM(RTRIM(IIF(instr(v_DATE,'AMENDED')>0,
reg_Extract(DATE,'.*(^\w+\s+[0-9]{2}[,]\s+[0-9]{4}|^\w+\s+[0-9]{1}[,]\s+[0-9]{4}).*'),
reg_Extract(DATE,'.*((\s0?[1-9]|1[012])[./-](0?[1-9]|[12][0-9]|3[01])[./-][0-9]{2,4}|(^0?[1-9]|1[012])[./-](0?[1-9]|[12][0-9]|3[01])[./-][0-9]{2,4}|(0[1-9]|[12][0-9]|3[01])[./-](0?[1-9]|1[012])[./-][0-9]{2,4}|\s\w+\s+(0?[1-9]|[12][0-9]|3[01])[.,](\s+)?[0-9]{4}|^\w+\s+(0?[1-9]|[12][0-9]|3[01])[.,](\s+)?[0-9]{4}|^(19|20)[0-9]{2}|^[0-9]{2}\s+\w+\s+[0-9]{4}|^[0-9]{6}|^(0?[1-9]|[12][0-9]|3[01])\s+\w+[.,]?\s+(19|20)[0-9]{2}|^[0-9]{1,2}[-,/]\w+[-,/][0-9]{2,4}).*'))))

In Pyspark, I have saved the source file in one dataframe and selected the required columns. After that I am unable to proceed.

input_data=spark.read.csv(file_path,header=True)
input_data.createOrReplaceTempView("input_data")

df_test = "select ACCESSION_NUMBER, DATE, REPORTING_PERSON from input_data"

df = sqlContext.sql(df_test)

I am new to Pyspark/SparkSQL. Please help.

You may want to look here : stackoverflow.com/questions/46410887/… — Rafaël

Rafaël Rafaël · Accepted Answer · 2019-08-08T09:51:46

You can use regexp_extract :

df = df.withColumn('New_Column_Name', regexp_extract(col('Date'), '.*(^\w+\s+[0-9]{2}[,]\s+[0-9]{4}|^\w+\s+[0-9]{1}[,]\s+[0-9]{4}).*', 1))

Need to convert Informatica reg_extract expression to Pyspark dataframe

1 Answers