I have written pyspark job to load files present in s3 bucket . In s3 there are too many small files , I am reading file one by one in spark . I am reading a file one one by one as I am adding one column that column has value of bucket path where file is present . Due to this spark job is spending so much of time as it is busy iterating file one by one .
below is code for that :
for filepathins3 in awsfilepathlist:
data = spark.read.format("parquet").load(filepathins3) \
.withColumn("path_s3", lit(filepathins3))
above code is taking so much of time as it is spending much of time reading file one by one , If I provide list of file's path then spark job finishes quickly , but with this approach I can not add column that has filepath as value in the data-frame .
is there way to solve above problem in pyspark job only , rather than creating a separate program to read files and then club and load into spark .
sc.parallelize(awsfilepathlist)
) and then use RDD.mapPartitions. When you want the RDD returned as a DataFrame just use the.toDF()
method on the RDD. – JZimmermanfor filepathins3 in list(awsfilepathlist):
)? – JZimmerman