0
votes

I would like to read n csv files using pyspark. The csv have the same schema but with different columns names.

enter image description here

While reading those files I would like to create an additional column 'pipeline' that contains a substring of first column name.

How can I implement this?

 df = spark.read.format("csv") \
                .option("header", True) \
                .load(path + "*.csv")
                .withColumn("pipeline", 
1

1 Answers

1
votes
df = spark.read.format("csv") \
                .option("header", "false") \
                .load(path + "*.csv")
                .toDF('header_1')
                .withColumn("pipeline", lit(path))