2
votes

I am trying to setup delta lake on S3 using the open source delta lake api . My tables are partitioned by date and I have to perform merge (Merge may also update old partitions) . I am generating manifest files so that I can use AWS Athena to query the delta lake but when I run the manifest file generation method delta lakes creates manifest files for all the partitions . Is there a way to do incremental manifest files generation , create/update files only for the last updated partitions or if you can specify the partitions to produce the manifest files .

df = spark.read.csv(s3://temp/2020-01-01.csv)
delta_table = DeltaTable.forPath(spark, delta_table_path)

delta_table.alias("source").merge(df.alias("new_data"), condition).whenNotMatchedInsertAll().execute()

delta_table.generate("symlink_format_manifest")
1
Have you tried identifying the updated partitions then pass the list to replace Where as in ".option("replaceWhere", "date = '2017-01-01' )" in iterative fashion rather than generating manifest files ? - Prabhakar Reddy
I can derive the list of updated partitions , the thing is I will have to check whether delta_table have the option attribute . I am not sure if you can do this : delta_table.option("replaceWhere", "date = '2017-01-01').generate("symlink_format_manifest") - priyansh jain
@priyanshjain did you got this working? - Explorer

1 Answers

0
votes

I was facing the same issue and running manifest on a huge table with tons of partitions was an overkill. I was able to resolve it by below two methods(workarounds)

  1. So the easy one is, use spark to create your delta table in Hive metastore using a DDL, provide the location to the folder(S3) along with TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true). Use spark to load the data in the same location and this will create/update manifest file for any partition as soon as the data is appended/overwritten.

spark.sql("CREATE TABLE student (id INT, name STRING, age INT) USING delta PARTITIONED BY (age) LOCATION 's3://path/student' TBLPROPERTIES(delta.compatibility.symlinkFormatManifest.enabled=true)")

For a new table this should not be a problem however, above is a workaround for a table which is already created and will require a reload.

  1. The other option(a tricky one) I followed is, locate and copy the metadata file(hdfs dfs -cat s3://path/student/_delta_log/*.json | grep 'metadata') in the _delta_log folder. Add the same aforementioned TBLPROPERTIES under commitInfo-->operationParameters as "properties":"{\"delta.compatibility.symlinkFormatManifest.enabled\":\"true\"} and under metaData as "configuration":{"delta.compatibility.symlinkFormatManifest.enabled":"true"} create a new .json and rename the file name as (last sequence of the json in _delta_log folder+1).json and move it to _delta_log. The next load onwards you can see it is creating manifest files automatically.