Array manipulation in Spark, Scala

Question

I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.

I have a dataframe having the following two columns:

 Name_Description        Grade

Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.

Name description is not an array that's of fixed size. It could be something like

['asdf_ Brandon', 'Ca%abc%rd'] ['fthhhhChris', 'Rock', 'is the %abc%man']

The only problems are the following:

 1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.

 2. %abc% randomly pops up on elements, so I wanna erase them.

Is there any way to achieve those two things in Scala? For instance, I just want

 ['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']

to change to

['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']

@KrzysztofAtłasik each row of Name_Description is an array of strings. ['asdf_ Brandon', 'Ca%abc%rd'] ['fthhhhChris', 'Rock', 'is the %abc%man'] represent two rows of Name_Description — user98235

Krzysztof Atłasik Krzysztof Atłasik · Accepted Answer · 2019-09-02T15:08:14

What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:

val removeGarbage = udf { arr: WrappedArray[String] => 
     //in case that array is empty we need to map over option
     arr.headOption 
     //drop first 6 characters from first element, then remove %abc% from the rest
        .map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
        .getOrElse(arr)  
}

Then you just need to use this UDF on your Name_Description column:

val df = List(
    (1, Array("asdf_ Brandon", "Ca%abc%rd")), 
    (2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")

df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)

Show prints:

+-----+-------------------------+
|Grade|Name_Description         |
+-----+-------------------------+
|1    |[Brandon, Card]          |
|2    |[Chris, Rock, is the man]|
+-----+-------------------------+

Array manipulation in Spark, Scala

2 Answers