0
votes

Hi I'm new to spark and I wanted to know how I can do string manipulation so Column1 - Column2 to get column3.

Note: My data is in data frame

So basically I have two different column string and I wanted to get only the string that is existed in column2 but not in column 1 to I can produce it as column3

Column1
SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE

Column2
SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY

Then Column3 should be...

Column3
GRAPES,BERRY

but for column1 and column2 I also wanted to show

APPLE,BANANA,ORANGE 

Just removing the SAMPLE_OUT_3 and doing having comma delimited

3
can you share the schema of dataframe ?koiralo
@koiralo |-- column1: string (nullable = true) |-- column2: string (nullable = true)Potatooo

3 Answers

1
votes

you can split your column by '|' like below import spark.implicits._

val df = mainDf.select("Column1","Column2").map(x => {
   val s1 = x.getAsString(0).replaceAll("^.*3_","").split("|");
   val s2 = x.getAsString(1).replaceAll("^.*3_","").split("|");
   (x.getAsString(0),x.getAsString(1),s2.diff(s1).union(s1.diff(s2)))
}
).toDF("Column1","Column2","Column3")
1
votes

You can also achieve your purpose through regexp_replace and udf.

  1. regexp_replace to replace "|" with "," and ".*3_" with ""
  2. udf to get the value of column3 from column2 and column1
val df1 = Seq(("SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE" ,"SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY")).toDF("column1","column2")           
val df2 =df1.columns.foldLeft(df) { (memoDF, colName) =>
            memoDF.withColumn(
            colName,
            regexp_replace(regexp_replace(col(colName), "\\|", ","),".*3_",""))}

val diff_udf = udf { ( a:  String, b:  String) => (a.split(",") diff b.split(",")).mkString(",") }

df2.withColumn("column3", diff_udf(col("column2"), col("column1"))).show(false)

Output:

+-------------------------+--------------------------------------+------------+
|column1                  |column2                               |column3     |
+-------------------------+--------------------------------------+------------+
|APPLE,BANANA,GUAVA,ORANGE|APPLE,BANANA,GUAVA,GRAPES,ORANGE,BERRY|GRAPES,BERRY|
+-------------------------+--------------------------------------+------------+
0
votes

For Spark >= 2.4

You can use array_except

import spark.implicits._

val df = Seq(
  ("SAMPLE_OUT_3_APPLE|BANANA|GUAVA|ORANGE" ,"SAMPLE_OUT_3_APPLE|BANANA|GUAVA|GRAPES|ORANGE|BERRY")
).toDF("column1", "column2")

val remove = df.columns.map(column => split(col(column), "3_").getItem(1).as(column))

val resultDF = df.select(remove: _*)
  .withColumn("column1", split($"column1", "\\|"))
  .withColumn("column2", split($"column2", "\\|"))
  .withColumn("column3", array_except($"column2", $"column1"))
  .withColumn("column1", array_except($"column1", $"column3"))
  .withColumn("column2", array_except($"column2", $"column3"))

val convertToString = resultDF.columns.map(column => concat_ws("|", col(column)).as(column))
resultDF.select(convertToString: _*).show(false)

Output:

+-------------------------+-------------------------+------------+
|column1                  |column2                  |column3     |
+-------------------------+-------------------------+------------+
|APPLE|BANANA|GUAVA|ORANGE|APPLE|BANANA|GUAVA|ORANGE|GRAPES|BERRY|
+-------------------------+-------------------------+------------+