I have about 60k file stored in HDFS, each file size is in range of kilo bytes 4kb-70kb. Am trying to process them by performing regex search on specific files I know yet, the processing takes too long, and it seems not right ...
the spark job is run on yarn
Hardware specs : 3 nodes, each has 4 core and 15G RAM
targeted_files = sc.broadcast(sc.textFile(doc).collect()) # 3 files
#hdfs://hadoop.localdomain/path/to/directory/ contains ~60K files
df = sc.wholeTextFiles(
"hdfs://hadoop.localdomain/path/to/directory/").filter(
lambda pairRDD: ntpath.basename(pairRDD[0]) in targeted_files.value)
print('Result : ', df.collect()) #when I run this step alone, took 15 mins to finish
df = df.map(filterMatchRegex).toDF(['file_name', 'result']) # this takes ~hour and still doesn't finish
would be using HDFS, spark for this task is correct ? also I thought in worst case scenario the processing time would be equal to threading approach using java ... what am I doing wrong ?
I came across this link which addresses the same problem, but am not sure how to handle it in pyspark it seems all/most of time taken during reading files from HDFS, is there a better way to read/store small files and process them with spark ?