PySpark reading multiple files while creating new column containing existing column name

Question

I would like to read n csv files using pyspark. The csv have the same schema but with different columns names.

While reading those files I would like to create an additional column 'pipeline' that contains a substring of first column name.

How can I implement this?

 df = spark.read.format("csv") \
                .option("header", True) \
                .load(path + "*.csv")
                .withColumn("pipeline",

Data_101 Data_101 · Accepted Answer · 2018-11-16T16:36:23

df = spark.read.format("csv") \
                .option("header", "false") \
                .load(path + "*.csv")
                .toDF('header_1')
                .withColumn("pipeline", lit(path))

PySpark reading multiple files while creating new column containing existing column name

1 Answers