1
votes

I am loading a batch csv file to postgres using python (Say Table A). I am using pandas to upload the data into chunk which is quite faster.

for chunk in pd.read_csv(csv_file, sep='|',chunksize=chunk_size,low_memory=False):

Now I want to update another table (say Table B) using A based on following rules

  • if there are any new records in table A which is not in table B then insert that as a new record in table B (based on Id field)
  • if the values changes in the Table A for the same ID which exists in Table B then update the records in table B using TableA (There are server tables which i need to update based on Table A )

I am able to do that using below and then loop through each row, but Table A always have records around 1,825,172 and it becomes extremely slow. Any forum member can help to speed this up or suggest a alternate approach to achieve the same.

cursor.execute(sql)
records = cursor.fetchall()

for row in records:  
    id= 0 if row[0] is None else row[0]  # Use this to match with Table B and decide insert or update     
    id2=0 if row[1] is None else row[1]   
    id2=0 if row[2] is None else row[2]    
2

2 Answers

2
votes

You could leverage Postgres upsert syntax, like:

insert into tableB tb (id, col1, col2)
select ta.id, ta.col1, ta.col2 from tableA ta
on conflict(id) do update
    set col1 = ta.col1, col2 = ta.col2
0
votes

You should do this completely inside the DBMS, not loop through the records inside your python script. That allows your DBMS to better optimize.

UPDATE TableB
SET    x=y
FROM TableA
WHERE TableA.id = TableB.id

INSERT INTO TableB(id,x)
SELECT id, y
FROM TableA
WHERE TableA.id NOT IN ( SELECT id FROM TableB )