How to process very small files in Spark

Question

I have about 60k file stored in HDFS, each file size is in range of kilo bytes 4kb-70kb. Am trying to process them by performing regex search on specific files I know yet, the processing takes too long, and it seems not right ...

the spark job is run on yarn

Hardware specs : 3 nodes, each has 4 core and 15G RAM

targeted_files = sc.broadcast(sc.textFile(doc).collect()) # 3 files

#hdfs://hadoop.localdomain/path/to/directory/ contains ~60K files
df = sc.wholeTextFiles(
    "hdfs://hadoop.localdomain/path/to/directory/").filter(
    lambda pairRDD: ntpath.basename(pairRDD[0]) in targeted_files.value)

print('Result : ', df.collect()) #when I run this step alone, took 15 mins to finish

df = df.map(filterMatchRegex).toDF(['file_name', 'result']) # this takes ~hour and still doesn't finish

would be using HDFS, spark for this task is correct ? also I thought in worst case scenario the processing time would be equal to threading approach using java ... what am I doing wrong ?

I came across this link which addresses the same problem, but am not sure how to handle it in pyspark it seems all/most of time taken during reading files from HDFS, is there a better way to read/store small files and process them with spark ?

information_interchange information_interchange · Accepted Answer · 2019-07-31T14:33:15

It doesn't seem like a right use case of spark to be honest. Your dataset is pretty small, 60k * 100k = 6 000 mB = 6 GB, which is within reason of being run on a single machine. Spark and HDFS add material overhead to processing, so the "worst case" is going to be clearly slower than a multi-threaded approach on a single machine. In general, parallelization on a single machine (multi-threading) will be faster than parallelizing over a cluster of nodes (Spark)

How to process very small files in Spark

3 Answers