I have a question. I have a spark dataframe with several columns looking like:
id Color
1 Red, Blue, Black
2 Red, Green
3 Blue, Yellow, Green
...
I also have a map file looking like :
Red,0
Blue,1
Green,2
Black,3
Yellow,4
what I need to do is to map the color name into different ids, such as mapping "Red, Blue, Black" into an array of [1,1,0,1,0]. I write a code like this way:
def mapColor(label_string:String):Array[Int]={
var labels = label_string.split(",")
var index_array = new Array[Int](COLOR_LENGTH)
for (label<-labels){
if(COLOR_MAP.contains(label)){
index_array(COLOR_MAP(label))=1
}
else{
//dictionary does not contain the label, the last index set to be one
index_array(COLOR_LENGTH-1)=1
}
}
index_array
}
The COLOR_LENGTH is the length of the dictionary, and COLOR_MAP is the dictionary that contains the string->id relationship.
I call this function like this way:
val color_function = udf(mapColor:(String)=>Array[Int])
sql.withColumn("color_idx",color_function(col("Color")))
Since I have multiple columns need this operation, but different columns need different dictionaries. Currently, I duplicate this function for each column (just change the dictionary and length information). But the code looks tedious. Is there any method, I can pass the length and dictionary into the mapping function, such as
def map(label_string:String,map:Map[String,Integer],len:Int):Array[Int]
But how should I call this function in the spark dataframe? Since there is no way for me to pass the parameter in the declaration
val color_function = udf(mapColor:(String)=>Array[Int])