I'm writing a big batch job using PySpark that ETLs 200 tables and loads into Amazon Redshift. These 200 tables are created from one input datasource. So the batch job is successful only when data is loaded into ALL 200 tables successfully. The batch job runs everyday while appending the data into tables for each date.
For fault tolerance, reliability and idempotency my current workflow follows:
- Using staging tables. Create temporary Redshift tables with
CREATE TEMP TABLE LIKE <target_table>
- Transform and Load data into staging table.
- Repeat 1-2 for 200 other tables.
- Start
BEGIN
transaction. - Copy staging table data into target table
using
INSERT INTO <taget_table> SELECT * FROM <staging_table>
END
TransactionDROP
all staging tables.
This way I can guarantee that if Step 3 fails (which is more probable), I don't have to worry about removing partial data from original tables. Rather, I'll simply re-run entire batch job since temporary tables are discarded after the JDBC disconnection.
While it solved most of the problems it's not elegant, hackish and consumes extra time. I would like to if Spark and/or Redshift provides standard tools to solve this very common problem in ETL world.
Thanks