Assuming I've a Dataframe with many columns, some are type string others type int and others type map.
e.g.
field/columns types: stringType|intType|mapType<string,int>|...
|--------------------------------------------------------------------------
| myString1 |myInt1| myMap1 |...
|--------------------------------------------------------------------------
|"this_is_#string"| 123 |{"str11_in#map":1,"str21_in#map":2, "str31_in#map": 31}|...
|"this_is_#string"| 456 |{"str12_in#map":1,"str22_in#map":2, "str32_in#map": 32}|...
|"this_is_#string"| 789 |{"str13_in#map":1,"str23_in#map":2, "str33_in#map": 33}|...
|--------------------------------------------------------------------------
I want to remove some characters like '_' and '#' from all columns of String and Map type so the result Dataframe/RDD will be:
|------------------------------------------------------------------------
|myString1 |myInt1| myMap1|... |
|------------------------------------------------------------------------
|"thisisstring"| 123 |{"str11inmap":1,"str21inmap":2, "str31inmap": 31}|...
|"thisisstring"| 456 |{"str12inmap":1,"str22inmap":2, "str32inmap": 32}|...
|"thisisstring"| 789 |{"str13inmap":1,"str23inmap":2, "str33inmap": 33}|...
|-------------------------------------------------------------------------
I am not sure if it's better to convert the Dataframe into an RDD and work with it or perform the work in the Dataframe.
Also, not sure how to handle the regexp with different column types in the best way (I am sing scala). And I would like to perform this action for all column of these two types (string and map), trying to avoid using the column names like:
def cleanRows(mytabledata: DataFrame): RDD[String] = {
//this will do the work for a specific column (myString1) of type string
val oneColumn_clean = mytabledata.withColumn("myString1", regexp_replace(col("myString1"),"[_#]",""))
...
//return type can be RDD or Dataframe...
}
Is there any simple solution to perform this? Thanks