I would like to recursively find the maximum date-time value of each subfolder and finally find the top parent's maximum timestamp. spark SQL is slower when I run. So I would like to implement this logic using UDF or data frame methods in pyspark.
+-----------+---------------+----------------+---------+
|File_Folder|Folder_File_Ind|folder_level_ind|Timestamp|
+-----------+---------------+----------------+---------+
| /A| parent-Folder| 1| null|
| /A/B| sub-folder| 2| null|
| /A/B/1.txt| file| 3| 02022021|
| /A/B/2.txt| file| 4| 02032021|
| /A/C| sub-folder| 2| null|
| /A/C/3.txt| file| 3| 02042021|
| /A/C/4.txt| file| 3| 02052021|
+-----------+---------------+----------------+---------+
The output should display like timestamp value ( Null values replaced with a maximum timestamp of each level)
Output
+-----------+---------------+---------+
|File_Folder|Folder_File_Ind|Timestamp|
+-----------+---------------+---------+
| /A| parent-Folder| 02052021|
| /A/B| sub-folder| 02032021|
| /A/B/1.txt| file| 02022021|
| /A/B/2.txt| file| 02032021|
| /A/C| sub-folder| 02052021|
| /A/C/3.txt| file| 02042021|
| /A/C/4.txt| file| 02052021|
+-----------+---------------+---------+
SQL I tried below SQL which is giving expected result. ( it's too slow when millions of records in the data frame) and I tried with data frame cache but it did not help. probably LIKE is a costly operation. ( FYI. I removed the time from the date format as it was not showing correctly. The time format is not a concern here) but subfolders and folder should
df.show()
+-----------+---------------+----------------+---------+
|File_Folder|Folder_File_Ind|folder_level_ind|Timestamp|
+-----------+---------------+----------------+---------+
| /A| parent-Folder| 1| null|
| /A/B| sub-folder| 2| null|
| /A/B/1.txt| file| 3| 02022021|
| /A/B/2.txt| file| 4| 02032021|
| /A/C| sub-folder| 2| null|
| /A/C/3.txt| file| 3| 02042021|
| /A/C/4.txt| file| 3| 02052021|
+-----------+---------------+----------------+---------+
>>> self_join_rec = sqlc.sql("SELECT \
... a.File_Folder, a.Folder_File_Ind, Max(b.Timestamp) Timestamp \
... FROM src_table a \
... JOIN src_table b on b.File_Folder LIKE Concat(a.File_Folder, '%') \
... GROUP BY \
... a.File_Folder, a.Folder_File_Ind \
... ORDER BY a.File_Folder,a.Folder_File_Ind"
... )
>>> self_join_rec.show()
+-----------+---------------+---------+
|File_Folder|Folder_File_Ind|Timestamp|
+-----------+---------------+---------+
| /A| parent-Folder| 02052021|
| /A/B| sub-folder| 02032021|
| /A/B/1.txt| file| 02022021|
| /A/B/2.txt| file| 02032021|
| /A/C| sub-folder| 02052021|
| /A/C/3.txt| file| 02042021|
| /A/C/4.txt| file| 02052021|
+-----------+---------------+---------+
join
with a condition using rlike or startswith. – jxc