0
votes

I have a delta table created as using spark 3.x and delta 0.7.x:

data = spark.range(0, 5)
data.write.format("delta").mode("overwrite").save("tmp/delta-table")
# add some more files
data = spark.range(20, 100)
data.write.format("delta").mode("append").save("tmp/delta-table")

df = spark.read.format("delta").load("tmp/delta-table")
df.show()

Now quite some files are generated in the log (many way too small parquet files).

%ls tmp/delta-table

I want to compact them:

df.createGlobalTempView("my_delta_table")
spark.sql("OPTIMIZE my_delta_table ZORDER BY (id)")

fails with:

ParseException: 
mismatched input 'OPTIMIZE' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
OPTIMIZE my_delta_table ZORDER BY (id)
^^^

Question:

  1. How can I get this to work (optimize) without failing the query
  2. is there a more native API than calling out to the text-based SQL?

Notice:

spark is started like this:

import pyspark
from pyspark.sql import SparkSession

spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
    .config("spark.jars.packages", "io.delta:delta-core_2.12:0.7.0") \
    .config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension") \
    .config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog") \
    .getOrCreate()

from delta.tables import *
2

2 Answers

2
votes

OPTIMIZE is not available in OSS Delta Lake. If you would like to compact files, you can follow instructions in the Compact files section. If you would like to use ZORDER, currently you need to use Databricks Runtime.

0
votes

If you are running Delta locally that means you must be using the OSS Delta Lake. The optimize command is available only Databricks Delta Lake. For doing file compaction in OSS, you can do the following - https://docs.delta.io/latest/best-practices.html#compact-files