11
votes

I have a dataframe in pyspark which has 15 columns.

The column name are id, name, emp.dno, emp.sal, state, emp.city, zip .....

Now I want to replace the column names which have '.' in them to '_'

Like 'emp.dno' to 'emp_dno'

I would like to do it dynamically

How can I achieve that in pyspark?

4

4 Answers

29
votes

You can use something similar to this great solution from @zero323:

df.toDF(*(c.replace('.', '_') for c in df.columns))

alternatively:

from pyspark.sql.functions import col

replacements = {c:c.replace('.','_') for c in df.columns if '.' in c}

df.select([col(c).alias(replacements.get(c, c)) for c in df.columns])

The replacement dictionary then would look like:

{'emp.city': 'emp_city', 'emp.dno': 'emp_dno', 'emp.sal': 'emp_sal'}

UPDATE:

if I have dataframe with space in column names also how do replace both '.' and space with '_'

import re

df.toDF(*(re.sub(r'[\.\s]+', '_', c) for c in df.columns))
5
votes

Wrote an easy & fast function for you to use. Enjoy! :)

def rename_cols(rename_df):
    for column in rename_df.columns:
        new_column = column.replace('.','_')
        rename_df = rename_df.withColumnRenamed(column, new_column)
    return rename_df
0
votes

Easiest way to do this is as follows:

Explanation:

  1. Get all columns in the pyspark dataframe using df.columns
  2. Create a list looping through each column from step 1
  3. The list will output:col("col.1").alias(c.replace('.',"_").Do this only for the required columns. Replace function helps to replace any pattern. Also, you can exclude a few columns from being renamed
  4. *[list] will unpack the list for select statement in pypsark

from pyspark.sql import functions as F (df .select(*[F.col(c).alias(c.replace('.',"_")) for c in df.columns]) .toPandas().head() )

Hope this helps

0
votes

MaxU's answer is good and efficient. This post outlines another approach that's also efficient and helps keep your codebase clean (using the quinn library).

Suppose you have the following DataFrame:

+---+-----+--------+-------+
| id| name|emp.city|emp.sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

Here's how you can replace the dots with underscores in all the columns.

import quinn

def dots_to_underscores(s):
    return s.replace('.', '_')
actual_df = df.transform(quinn.with_columns_renamed(dots_to_underscores))
actual_df.show()

Here's the resulting actual_df:

+---+-----+--------+-------+
| id| name|emp_city|emp_sal|
+---+-----+--------+-------+
| 12|  bob|New York|     80|
| 99|alice| Atlanta|     90|
+---+-----+--------+-------+

Let's use explain() to verify that this function is executing efficiently:

actual_df.explain(True)

Here's the logical plans that are outputted:

== Parsed Logical Plan ==
'Project ['id AS id#50, 'name AS name#51, '`emp.city` AS emp_city#52, '`emp.sal` AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#29 AS id#50, name#30 AS name#51, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Optimized Logical Plan ==
Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]
+- LogicalRDD [id#29, name#30, emp.city#31, emp.sal#32], false

== Physical Plan ==
*(1) Project [id#29, name#30, emp.city#31 AS emp_city#52, emp.sal#32 AS emp_sal#53]

You can see that the parsed logical plan is almost identical to the physical plan, so the Catalyst optimizer doesn't need to do much optimization work. It's converting id AS id#50 to id#29, but that's not too much work.

The with_some_columns_renamed method generates an even more efficient parsed plan.

def dots_to_underscores(s):
    return s.replace('.', '_')
def change_col_name(s):
  return '.' in s
actual_df = df.transform(quinn.with_some_columns_renamed(dots_to_underscores, change_col_name))
actual_df.explain(True)

This parsed plan only aliases the columns with dots.

== Parsed Logical Plan ==
'Project [unresolvedalias('id, None), unresolvedalias('name, None), '`emp.city` AS emp_city#42, '`emp.sal` AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Analyzed Logical Plan ==
id: string, name: string, emp_city: string, emp_sal: string
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Optimized Logical Plan ==
Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]
+- LogicalRDD [id#34, name#35, emp.city#36, emp.sal#37], false

== Physical Plan ==
*(1) Project [id#34, name#35, emp.city#36 AS emp_city#42, emp.sal#37 AS emp_sal#43]

More information why looping over the DataFrame and calling withColumnRenamed multiple times creates overly complex parsed plans and should be avoided.