Pyspark: Create a pyspark dataframe based on columns names from other pyspark dataframe

Question

I have two pyspark dfs

df1 has columns - a, b, c, d, e, f df2 has columns - c, d, e (Column names keep changing dynamically)

I want a df3 dataframe which is extracted from df1 based on the columns names from df2. So basically I want

select columns from df1 based on columns in df2 (df2 columns keep changing)

In above example result df should have columns - c, d, e (extracted from df1)

I unable to find any method which can achieve this. Please help

If my answer helped you then please mark it as the answer. :) — Lamanus

Lamanus Lamanus · Accepted Answer · 2020-08-02T05:35:35

You can get the columns by df2.columns of the second dataframe and just select those columns from the first dataframe.

df1 = spark.read.option("header","true").option("inferSchema","true").csv("test.csv")
df2 = spark.read.option("header","true").option("inferSchema","true").csv("test2.csv")

df3 = df1.select(df2.columns)
df3.show(10, False)

+---+---+---+
|c  |d  |e  |
+---+---+---+
|3  |4  |5  |
+---+---+---+

Pyspark: Create a pyspark dataframe based on columns names from other pyspark dataframe

1 Answers