I am reading a .txt file using wholeTextFiles() in python spark. I know that after reading wholeTextFiles(), the resultant rdd will be of format (filepath, content). I have multiple files to read. I want to cut the file name from the filepath and save to a spark dataframe and a part of the filename as a date folder in HDFS location. But while saving, I am not getting the corresponding filenames. Is there any way to do so? Below is my code
base_data = sc.wholeTextFiles("/user/nikhil/raw_data/")
data1 = base_data.map(lambda x : x[0]).flatMap(lambda x : x.split('/')).filter(lambda x : x.startswith('CH'))
data2=data1.flatMap(lambda x : x.split('F_')).filter(lambda x : x.startswith('2'))
print(data1.collect())
print(data2.collect())
df.repartition(1).write.mode('overwrite').parquet(outputLoc + "/xxxxx/" + data2)
logdf = sqlContext.createDataFrame(
[(data1, pstrt_time, pend_time, 'DeltaLoad Completed')],
["filename","process_start_time", "process_end_time", "status"])`
output :
data1: ['CHNC_P0BCDNAF_20200217', 'CHNC_P0BCDNAF_20200227', 'CHNC_P0BCDNAF_20200615', 'CHNC_P0BCDNAF_20200925']
data2: ['20200217', '20200227', '20200615', '20200925']