1
votes

From what I understand Bronze table in Delta Lake architecture represents the raw and (more or less) unmodified data in a table format. Does this mean that I also shouldn't partition the data for the Bronze table? You could see partitioning as something that depends on the use case, which points to Silver or even Gold table.

Look at this example:

def read():
    return spark.read\
        .format("csv")\
        .option("delimiter", "\t")\
        .option("header", True)\
        .load("file.tsv.gz")

table_name = "file"
location = f"/mnt/storage/{table_name}"

read().write.partitionBy("something").format("delta").save(location)

spark.sql(f"CREATE TABLE {table_name} USING DELTA LOCATION '{location}/'")

Notice the partitionBy("something"). Does this belong in a Bronze table?

1

1 Answers

1
votes

Generally speaking I would recommend not partitioning by a predicate in the bronze layer. You should use OPTIMIZE to maintain 'right-sized' files for subsequent reads without introducing additional bias in how the data is organized in storage.

Partitioning and Z-Ordering can speed up reads by improving data skipping. Implicit in your choice of predicate to partition by, however, is some business logic. This can introduce a form of bias to your data and can have unintended downstream effects in your pipelines. The concept of 'bronze' is to simply land the data in the lake as it is, with as little changed as possible.

There are likely exceptions to this, but these are the general considerations.