Replace words in Data frame using List of words in another Data frame in Spark Scala

Question

I have two dataframes, lets say df1 and df2 in Spark Scala

df1 has two fields, 'ID' and 'Text' where 'Text' has some description (Multiple words). I have already removed all special characters and numeric characters from field 'Text' leaving only alphabets and spaces.

df1 Sample

+--------------++--------------------+
|ID            ||Text                |     
+--------------++--------------------+
| 1            ||helo how are you    |
| 2            ||hai haiden          |
| 3            ||hw are u uma        |
--------------------------------------

df2 contains a list of words and corresponding replacement words

df2 Sample

+--------------++--------------------+
|Word          ||Replace             |     
+--------------++--------------------+
| helo         ||hello               |
| hai          ||hi                  |
| hw           ||how                 |
| u            ||you                 |
--------------------------------------

I would need to find all occurrence of words in df2("Word") from df1("Text") and replace it with df2("Replace")

With the sample dataframes above, I would expect a resulting dataframe, DF3 as given below

df3 Sample

+--------------++--------------------+
|ID            ||Text                |     
+--------------++--------------------+
| 1            ||hello how are you   |
| 2            ||hi haiden           |
| 3            ||how are you uma     |
--------------------------------------

Your help is greatly appreciated in doing the same in Spark using Scala.

philantrovert philantrovert · Accepted Answer · 2017-07-14T09:06:46

It'd be easier to accomplish this if you convert your df2 to a Map. Assuming it's not a huge table, you can do the following :

val keyVal = df2.map( r =>( r(0).toString, r(1).toString ) ).collect.toMap

This will give you a Map to refer to :

scala.collection.immutable.Map[String,String] = Map(helo -> hello, hai -> hi, hw -> how, u -> you)

Now you can use UDF to create a function that will utilize keyVal Map to replace values :

val getVal = udf[String, String] (x => x.split(" ").map(x => res18.get(x).getOrElse(x) ).mkString( " " ) )

Now, you can call the udf getVal on your dataframe to get the desired result.

df1.withColumn("text" , getVal(df1("text")) ).show


+---+-----------------+
| id|             text|
+---+-----------------+
|  1|hello how are you|
|  2|        hi haiden|
|  3|  how are you uma|
+---+-----------------+

Replace words in Data frame using List of words in another Data frame in Spark Scala

3 Answers