0
votes

I have about 60k file stored in HDFS, each file size is in range of kilo bytes 4kb-70kb. Am trying to process them by performing regex search on specific files I know yet, the processing takes too long, and it seems not right ...

the spark job is run on yarn

Hardware specs : 3 nodes, each has 4 core and 15G RAM

targeted_files = sc.broadcast(sc.textFile(doc).collect()) # 3 files

#hdfs://hadoop.localdomain/path/to/directory/ contains ~60K files
df = sc.wholeTextFiles(
    "hdfs://hadoop.localdomain/path/to/directory/").filter(
    lambda pairRDD: ntpath.basename(pairRDD[0]) in targeted_files.value)

print('Result : ', df.collect()) #when I run this step alone, took 15 mins to finish

df = df.map(filterMatchRegex).toDF(['file_name', 'result']) # this takes ~hour and still doesn't finish

would be using HDFS, spark for this task is correct ? also I thought in worst case scenario the processing time would be equal to threading approach using java ... what am I doing wrong ?

I came across this link which addresses the same problem, but am not sure how to handle it in pyspark it seems all/most of time taken during reading files from HDFS, is there a better way to read/store small files and process them with spark ?

3

3 Answers

1
votes

It doesn't seem like a right use case of spark to be honest. Your dataset is pretty small, 60k * 100k = 6 000 mB = 6 GB, which is within reason of being run on a single machine. Spark and HDFS add material overhead to processing, so the "worst case" is going to be clearly slower than a multi-threaded approach on a single machine. In general, parallelization on a single machine (multi-threading) will be faster than parallelizing over a cluster of nodes (Spark)

0
votes

In general the best tool to do search in a Hadoop setting is SOLR. It is optimized for searching, so though a tool like spark can get the job done you will never expect similar performance.

0
votes

Try df.coalesce(20) after loading to decrease the number of partitions and keep their size about ~128MB. Perform transformations and actions afterwards.