How to determine original s3 input filenames from a pyspark rdd or partition

Question

I am using pyspark streaming to ETL input files from S3.

I need to be able to build an audit trail of all of the raw input files on s3:// and where my parquet output ends up on hdfs://.

Given a dstream, rdd, or even a specific rdd partition, is it possible to determine the original filename(s) of the input data in s3?

Currently the only way I know to do this is to take the rdd.toDebugString() and attempt to parse it. However this feels really hacky and does not work in some cases. For example, parsing the debug output does not work for batch mode imports that I am also doing (using sc.TextFile("s3://...foo/*") style globs).

Does anyone have a sane way of determining the original filename(s)?

It seems some other spark users have had this question in the past, for example:

http://apache-spark-user-list.1001560.n3.nabble.com/Access-original-filename-in-a-map-function-tt2831.html

Thanks!

WoodChopper WoodChopper · Accepted Answer · 2015-11-28T20:19:20

We had the same kind of problem and files were small enough so, we used sc.wholeTextFiles("s3:...foo/*") .

which creates RDD of ("<path/filename>","<content>") and we appended the file name to content of files for usage.

How to convert RDD[(String, String)] into RDD[Array[String]]?

How to determine original s3 input filenames from a pyspark rdd or partition

1 Answers