1
votes

I am new to Scala and Spark. I want to derive a new column from existing columns of data frame by computing edit distance. For example FNAME and LNAME are two columns of data frame, wanted to add new column called NAMESCORE which keeps edit distance of FNAME to LNAME. Kindly please advise with a working or pseudo code.

Here is the link I got some partial answer.

Derive multiple columns from a single column in a Spark DataFrame

2

2 Answers

0
votes

You can use UDF:

def udfToFindEditDistance(col1 :String,col2 :String): String ={
    //find edit distance b/w col1 and col2 
  }

Register the udf

 val newUDF=udf(udfToFindEditDistance(_:String,_:String)) 

Adding a new column

val newDf=df.withColumn("newColumnName",newUDF(df("FNAME"),df("LNAME")))
0
votes

You can use the levenshtein function to compute the edit distance.

https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html#levenshtein(org.apache.spark.sql.Column,%20org.apache.spark.sql.Column)

val sqlContext = new SQLContext(sc)
val df = sqlContext.read
    .format("com.crealytics.spark.excel")
    ...
    .load(...)
df.withColumn("NAME_DISTANCE", levenshtein($"Name_Left", $"Name_Right"))