4
votes

suppose I have multiple CSV files in the same directory, these files all share the same schema.

/tmp/data/myfile1.csv, /tmp/data/myfile2.csv, /tmp/data.myfile3.csv, /tmp/datamyfile4.csv

I would like to read these files into a Spark DataFrame or RDD, and I would like each file to be a parition of the DataFrame. How can I do this?

1

1 Answers

2
votes

You have two options I can think of:

1) Use the Input File name

Instead of trying to control the partitioning directly, add the name of the input file to your DataFrame and use that for any grouping/aggregation operations you need to do. This is probably your best option as it is more aligned with the parallel processing intent of spark where you tell it what to do and let it figure out the how. You do this with code like this:

SQL:

SELECT input_file_name() as fname FROM dataframe

Or Python:

from pyspark.sql.functions import input_file_name

newDf = df.withColumn("filename", input_file_name())

2) Gzip your CSV files

Gzip is not a splittable compression format. This means when loading gzipped files, each file will be it's own partition.