3
votes

I'm new to scala, spark, and I have a problem while trying to learn from some toy dataframes.

I have a dataframe having the following two columns:

 Name_Description        Grade

Name_Description is an array, and Grade is just a letter. It's Name_Description that I'm having a problem with. I'm trying to change this column when using scala on Spark.

Name description is not an array that's of fixed size. It could be something like

['asdf_ Brandon', 'Ca%abc%rd'] ['fthhhhChris', 'Rock', 'is the %abc%man']

The only problems are the following:

 1. the first element of the array ALWAYS has 6 garbage characters, so the real meaning starts at 7th character.

 2. %abc% randomly pops up on elements, so I wanna erase them.

Is there any way to achieve those two things in Scala? For instance, I just want

 ['asdf_ Brandon', 'Ca%abc%rd'], ['fthhhhChris', 'Rock', 'is the %abc%man']

to change to

['Brandon', 'Card'], ['Chris', 'Rock', 'is the man']
2
So is Name_Description an array of arrays of strings?Krzysztof Atłasik
@KrzysztofAtłasik each row of Name_Description is an array of strings. ['asdf_ Brandon', 'Ca%abc%rd'] ['fthhhhChris', 'Rock', 'is the %abc%man'] represent two rows of Name_Descriptionuser98235

2 Answers

2
votes

What you're trying to do might be hard to achieve using standard spark functions, but you could define UDF for that:

val removeGarbage = udf { arr: WrappedArray[String] => 
     //in case that array is empty we need to map over option
     arr.headOption 
     //drop first 6 characters from first element, then remove %abc% from the rest
        .map(head => head.drop(6) +: arr.tail.map(_.replace("%abc%","")))
        .getOrElse(arr)  
}

Then you just need to use this UDF on your Name_Description column:

val df = List(
    (1, Array("asdf_ Brandon", "Ca%abc%rd")), 
    (2, Array("fthhhhChris", "Rock", "is the %abc%man"))
).toDF("Grade", "Name_Description")

df.withColumn("Name_Description", removeGarbage($"Name_Description")).show(false)

Show prints:

+-----+-------------------------+
|Grade|Name_Description         |
+-----+-------------------------+
|1    |[Brandon, Card]          |
|2    |[Chris, Rock, is the man]|
+-----+-------------------------+
2
votes

We are always encouraged to use spark sql functions and avoid using the UDFs as long as we can. I have a simplified solution for this which makes use of the spark sql functions.

Please find below my approach. Hope it helps.

val d = Array((1,Array("asdf_ Brandon","Ca%abc%rd")),(2,Array("fthhhhChris", "Rock", "is the %abc%man")))

val df = spark.sparkContext.parallelize(d).toDF("Grade","Name_Description")

This is how I created the input dataframe.

df.select('Grade,posexplode('Name_Description)).registerTempTable("data")

We explode the array along with the position of each element in the array. I register the dataframe in order to use a query to generate the required output.

spark.sql("""select Grade, collect_list(Names) from (select Grade,case when pos=0 then substring(col,7) else replace(col,"%abc%","") end as Names from data) a group by Grade""").show

This query will give out the required output. Hope this helps.