Is there any way to capture the input file name of multiple parquet files read in with a wildcard in Spark?

Question

I am using Spark to read multiple parquet files into a single RDD, using standard wildcard path conventions. In other words, I'm doing something like this:

val myRdd = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet")

However, sometimes these Parquet files will have different schemas. When I'm doing my transforms on the RDD, I can try and differentiate between them in the map functions, by looking for the existence (or absence) of certain columns. However a surefire way to know which schema a given row in the RDD uses - and the way I'm asking about specifically here - is to know which file path I'm looking at.

Is there any way, on an RDD level, to tell which specific parquet file the current row came from? So imagine my code looks something like this, currently (this is a simplified example):

val mapFunction = new MapFunction[Row, (String, Row)] {
  override def call(row: Row): (String, Row) = myJob.transform(row)
}

val pairRdd = myRdd.map(mapFunction, encoder=kryo[(String, Row)]

Within the myJob.transform( ) code, I'm decorating the result with other values, converting it to a pair RDD, and do some other transforms as well.

I make use of the row.getAs( ... ) method to look up particular column values, and that's a really useful method. I'm wondering if there are any similar methods (e.g. row.getInputFile( ) or something like that) to get the name of the specific file that I'm currently operating on?

Since I'm passing in wildcards to read multiple parquet files into a single RDD, I don't have any insight into which file I'm operating on. If nothing else, I'd love a way to decorate the RDD rows with the input file name. Is this possible?

Raman Narasimhan Raman Narasimhan · Accepted Answer · 2019-07-24T16:58:19

You can add a new column for the file name as shown below

import org.apache.spark.sql.functions._
val myDF = spark.read.parquet("s3://my-bucket/my-folder/**/*.parquet").withColumn("inputFile", input_file_name())

Is there any way to capture the input file name of multiple parquet files read in with a wildcard in Spark?

1 Answers