Spark copying dataframe columns best practice in Python/PySpark?

4

votes

This is for Python/PySpark using Spark 2.3.2. I am looking for best practice approach for copying columns of one data frame to another data frame using Python/PySpark for a very large data set of 10+ billion rows (partitioned by year/month/day, evenly). Each row has 120 columns to transform/copy. The output data frame will be written, date partitioned, into another parquet set of files.

Example schema is: input DFinput (colA, colB, colC) and output DFoutput (X, Y, Z)

I want to copy DFInput to DFOutput as follows (colA => Z, colB => X, colC => Y).

What is the best practice to do this in Python Spark 2.3+ ? Should I use DF.withColumn() method for each column to copy source into destination columns? Will this perform well given billions of rows each with 110+ columns to copy?

Thank you

pythonapache-sparkpyspark

withColumns suffers from performance – thebluephantom

may be an idea to accept an answer – thebluephantom

1

votes

This interesting example I came across shows two approaches and the better approach and concurs with the other answer. This is Scala, not pyspark, but same principle applies, even though different example.

import org.apache.spark.sql.functions._
import spark.implicits._

val df = Seq(
             ("1","2", "3"),
             ("4", "5", "6"),
             ("100","101", "102")
            ).toDF("c1", "c2", "c3")

This is expensive, that is withColumn, that creates a new DF for each iteration:

val df2 = df.columns.foldLeft(df) { case (df, col) =>
          df.withColumn(col, df(col).cast("int"))
          }
//df2.show(false)

This is faster.

val df3 = df.select(df.columns.map { col =>
          df(col).cast("int")
          }: _*)
//df3.show(false)

1

votes

Use dataframe.withColumn() which Returns a new DataFrame by adding a column or replacing the existing column that has the same name.

1

votes

Another way for handling column mapping in PySpark is via dictionary. Dictionaries help you to map the columns of the initial dataframe into the columns of the final dataframe using the the key/value structure as shown below:

from pyspark.sql.functions import col

df = spark.createDataFrame([
  [1, "John", "2019-12-01 10:00:00"],
  [2, "Michael", "2019-12-01 11:00:00"],
  [2, "Michael", "2019-12-01 11:01:00"],
  [3, "Tom", "2019-11-13 20:00:00"],
  [3, "Tom", "2019-11-14 00:00:00"],
  [4, "Sofy", "2019-10-01 01:00:00"]
], ["A", "B", "C"])


col_map = {"A":"Z", "B":"X", "C":"Y"}

df.select(*[col(k).alias(col_map[k]) for k in col_map]).show()

# +---+-------+-------------------+
# |  Z|      X|                  Y|
# +---+-------+-------------------+
# |  1|   John|2019-12-01 10:00:00|
# |  2|Michael|2019-12-01 11:00:00|
# |  2|Michael|2019-12-01 11:01:00|
# |  3|    Tom|2019-11-13 20:00:00|
# |  3|    Tom|2019-11-14 00:00:00|
# |  4|   Sofy|2019-10-01 01:00:00|
# +---+-------+-------------------+

Here we map A, B, C into Z, X, Y respectively.

And if you want a modular solution you also put everything inside a function:

def transform_cols(mappings, df):
  return df.select(*[col(k).alias(mappings[k]) for k in mappings])

Or even more modular by using monkey patching to extend the existing functionality of the DataFrame class. Place the next code on top of your PySpark code (you can also create a mini library and include it on your code when needed):

from pyspark.sql import DataFrame

def transform_cols(self, mappings):
  return self.select(*[col(k).alias(mappings[k]) for k in mappings])

DataFrame.transform = transform_cols

Then call it with:

df.transform(col_map).show()

PS: This could be a convenient way to extend the DataFrame functionality by creating your own libraries and expose them via the DataFrame and monkey patching (extension method for those familiar with C#).

0

votes

The approach using Apache Spark - as far as I understand your problem - is to transform your input DataFrame into the desired output DataFrame. You can simply use selectExpr on the input DataFrame for that task:

outputDF = inputDF.selectExpr("colB as X", "colC as Y", "colA as Z")

This transformation will not "copy" data from the input DataFrame to the output DataFrame.

0

votes

Bit of a noob on this (python), but might it be easier to do that in SQL (or what ever source you have) and then read it into a new/separate dataframe?

Spark copying dataframe columns best practice in Python/PySpark?

5 Answers