Override underlying parquet data seamlessly for impala table

Question

I have an Impala table backed by parquet files which is used by another team. Every day I run a batch Spark job that overwrites the existing parquet files (creating new data set, the existing files will be deleted and new files will be created)

Our Spark code look like this

dataset.write.format("parquet").mode("overwrite").save(path)

During this update (overwrite parquet data file and then REFRESH Impala table), if someone accesses the table then they would end up with error saying the underlying data files are not there.

Is there any solution or workaround available for this issue? Because I do not want other teams see the error at any point in time when they access the table.

Maybe I can write the new data files into different location then make Impala table point to that location?

Can you explain a bit more about "override parquet data file"? Are you removing the parquet files first and writing new Parquet data files on the same directory using Spark? — Gomz
"..they would end up with error.." -- could you add the exact error you're getting when running a query? — mazaneicha

Gomz Gomz · Accepted Answer · 2020-03-11T05:43:25

The behaviour you are seeing is because of the way how Impala is designed to work. Impala fetches the Metadata of the table such as Table structure, Partition details, HDFS File paths from HMS and the block details of the corresponding HDFS File paths from NameNode. All these details are fetched by Catalog and will be distributed across the Impala daemons for their execution.

When the table's underlying files are removed and new files are written outside Impala, it is necessary to perform a REFRESH so that the new file details (such as files and corresponding block details) will be fetched and distributed across daemons. This way Impala becomes aware of the newly written files.

Since, you're overwriting the files, Impala queries would fail to find the files that it is aware of because they have been removed already and the new files are being written. This is an expected event.

As a solution, you can perform one of the below,

Append the new files in the same HDFS Path of the table, instead of overwriting. This way, Impala queries run on the table would still return the results. However the results would be only the older data (because Impala is not aware of new files yet) but the error you said would be avoided during the time when the overwrite is occurring. Once the new files are created in the Table's directories, you can perform a HDFS Operation to remove the files followed by an Impala REFRESH statement for this table.

OR

As you said, you can write the new parquet files in a different HDFS Path and once the write is complete, you can either [remove the old files, move the new files into the actual HDFS Path of the table, followed by a REFRESH] OR [Issue an ALTER statement against the table to modify the location of the table's data pointing to the new directory]. If it's a daily process, you might have to implement this through a script that runs upon successful write process done by Spark by passing the directories (new and old directories) as arguments.

Hope this helps!

Override underlying parquet data seamlessly for impala table

1 Answers