I am in need of processing several thousands small log files.
I opted for Databricks to handle this problem, because it has good parallel computing capacities and interacts nicely with the Azure Blob storage account where the files are hosted.
After some researching, I always retrieve the same snippet of code (in PySpark).
# Getting your list of files with custom function
list_of_files = get_my_files()
# Create a path_rdd and use a custom udf to parse it
path_rdd = sc.parallelize(list_of_files)
content = path_rdd.map(parse_udf).collect()
Is there a better any method to do this? Would you opt for a flatmap if the logfiles are in a CSV format?
Thank you!
spark.read.format('csv').load("folder_name")) - this way you will leverage spark internal parallel processing instead parsing every file as a UDF. - Hussain Bohradf = spark.read.format("csv").option("header", "true").load("cars_data/")will automatically add year, month and date as a column, which you can utilize for filter and that will certainly provide you performance gain. - Hussain Bohraspark.read.csv("location/*/*/"). - Oliver W.