MaxU's answer is good and efficient. This post outlines another approach that's also efficient and helps keep your codebase clean (using the quinn library).
Suppose you have the following DataFrame:
+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12| bob|New York| 80|
| 99|alice| Atlanta| 90|
+---+-----+--------+-------+
Here's how you can replace the dots with underscores in all the columns.
import quinn
def dots_to_underscores(s):
return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()
Here's the resulting actual_df
:
+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12| bob|New York| 80|
| 99|alice| Atlanta| 90|
+---+-----+--------+-------+
Let's use explain()
to verify that this function is executing efficiently:
actual_df.explain(True)
Here's the logical plans that are outputted:
== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false
== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
You can see that the parsed logical plan is almost identical to the physical plan, so the Catalyst optimizer doesn't need to do much optimization work. It's converting id AS id#50
to id#29
, but that's not too much work.
The with_some_columns_renamed
method generates an even more efficient parsed plan.
def dots_to_underscores(s):
return s.replace('.', '_')
def change_col_name(s):
return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)
This parsed plan only aliases the columns with dots.
== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false
== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
More information why looping over the DataFrame and calling withColumnRenamed
multiple times creates overly complex parsed plans and should be avoided.