I am using pyspark streaming to ETL input files from S3.
I need to be able to build an audit trail of all of the raw input files on s3:// and where my parquet output ends up on hdfs://.
Given a dstream, rdd, or even a specific rdd partition, is it possible to determine the original filename(s) of the input data in s3?
Currently the only way I know to do this is to take the
rdd.toDebugString() and attempt to parse it. However this feels really hacky and does not
work in some cases. For example, parsing the debug output does not work for batch mode imports that I
am also doing (using sc.TextFile("s3://...foo/*") style globs).
Does anyone have a sane way of determining the original filename(s)?
It seems some other spark users have had this question in the past, for example:
Thanks!