PySpark unzip files: Which is a good approach for unzipping files and storing the csv files into a Delta Table?

Question

I have zip files stored in Amazon s3 then I have a Python list as ["s3://mybucket/file1.zip", ..., "s3://mybucket/fileN.zip"], I need to unzip all these files using a Spark Cluster, and stored all the CSV files into a delta format table. I would like to know a faster processing approach than my current approach:

1) I have a bucle for for iterating in my Python list.

2) I'm obtaining the zip files from s3 using Python Boto3 s3.bucket.Object(file)

3) I'm unzipping the files using the next code

import io
import boto3
import shutil
import zipfile
for file in ["s3://mybucket/file1.zip", ..., "s3://mybucket/fileN.zip"]:
    obj = s3.bucket.Object(file)
    with io.BytesIO(obj.get()["Body"].read()) as tf:
        tf.seek(0)
        with zipfile.ZipFile(tf, mode='r') as zipf:
            for subfile in zipf.namelist():
                zipf.extract(subfile, outputZip)
    dbutils.fs.cp("file:///databricks/driver/{0}".format(outputZip), "dbfs:" + outputZip, True)
    shutil.rmtree(outputZip)
    dbutils.fs.rm("dbfs:" + outputZip, True)

4) My files are unzipped in the Driver Node, then the executors can't reach these files (I don't find a way to do it) so I move all these csv files to DBFS using dbutils.fs.cp()

5) I read all the csv files from DBFS using a Pyspark Dataframe and I write that into a Delta table

df = self.spark.read.option("header", "true").csv("dbfs:" + file) 
df.write.format("delta").save(path)

6) I delete the data from DBFS and the Driver Node

So, my current goal is to ingest zip files from S3 into a Delta table in less time than my previous process. I suppose that I can parallelize some of these processes as the 1) step, I would like to avoid the copy step to DBFS because I don't need to have the data there, also I need to remove the CSV files after each ingests into a Delta Table to avoid a memory error in the Driver Node disk. Any advice?

Depends on your CSV files that can works: df = spark.read.csv("s3://bucket/file.csv.zip", sep=',') — Eric Bellet

Dipesh Patel Dipesh Patel · Accepted Answer · 2019-11-08T15:46:47

Well, Multiple possible solutions could be:

You can read all the files together (if schema allows it) with df=spark.read.csv("s3://mybucket") and write the dataframe as delta with df.write.format("delta").save(path)
You can read each file individually in dataframe and append to existing delta table (even if it is empty) directly without storing it in DBFS. For more details: https://docs.databricks.com/delta/delta-batch.html#append-using-dataframes
You can read each file individually in dataframe and union it into existing main dataframe. In the end, You can write main dataframe as delta table.

Option 3 would be something like:

    import io
    import boto3
    import shutil
    import zipfile
    from pyspark.sql import SparkSession

    spark = SparkSession.builder.appName("name").getOrCreate()

    schema = StructType([
    \\ YOUR DATA SCHMEA
    ])

    df = spark.createDataFrame([], schema)

    for file in ["s3://mybucket/file1.zip", ..., "s3://mybucket/fileN.zip"]:
        obj = s3.bucket.Object(file)
        with io.BytesIO(obj.get()["Body"].read()) as tf:
            tf.seek(0)
            with zipfile.ZipFile(tf, mode='r') as zipf:
                for subfile in zipf.namelist():
                    zipf.extract(subfile, outputZip)
        tempdf = spark.read.option("header", "true").csv(outputZip)
        df = df.union(tempdf)      

    df.write.format("delta").save(path)

PySpark unzip files: Which is a good approach for unzipping files and storing the csv files into a Delta Table?

2 Answers