Update large amount of data in SQL database via Airflow

Question

I have large table in CloudSQL that needs to be updated every hour, and I'm considering Airflow as a potential solution. What is the best way to update a large amount of data in a CloudSQL database from Airflow?

The constrain are:

The table need still be readable while the job is running
The table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time

Some of the ideas I have:

Load data needs to update into a pandas framework and run pd.to_sql
Load data into a csv in Cloud Storage and execute LOAD DATA LOCAL INFILE
Load data in memory, break it into chunks, and run a multi-thread process that each update the table chunk by chunk using a shared connection pool to prevent exhausting connection limits

Yong Wang Yong Wang · Accepted Answer · 2020-05-27T09:17:42

One clue for your reference based on postgresql partition table but need some DML operation define the partitioned table.

Currently, you main constrains are:

the table need still be readable while the job is running

It means no lock allowed.

the table need to be writable in case one of the job runs overtime and 2 jobs end up running at the same time

it should capable with multiple writing in sample time.

I add one things for you may considered as well:

reasonable read performance while writing. ** performance and user experience is key

Partition table could reach all requirements. It is transparence to client applicationi.

At present, you are doing ETL, soon will facing performance issue as the table size gain quickly. The partitioned table is only solution.

The main steps are:

Create partition table with partition list.
normal reading and writing to the table running as usual.
ETL process(could be in parallel):

-. ETL data and uploaded to new table. (very slow, minutes to hours. but no impact to main table)

-. Add the new table to the main table partition list. (super fast, micro seconds level to enable main table)
normal main table reading and write as usual with new data.

If you like the answer, pls vote it up.

Best Regards, WY

Update large amount of data in SQL database via Airflow

3 Answers