1
votes

I have a CSV file that has all entries quoted i.e. with opening and closing quotes. When I import to the database using copy_from, the database table contains quotes on the data and where there is an empty entry I get quotes only i.e. "" entries in the column as seen below

[Bad Quoted text[1]

Is there a way to tell copy_from to ignore quotes so that when I import the file the text doesn't have quotes around it and empty entries are converted to Null as below?

Expected data entries

Here is my code:

with open(source_file_path) as inf:
  cursor.copy_from(inf, table_name, columns=column_list, sep=',', null="None")

UPDATE:

I still haven't got a solution to the above but for the sake of getting the file imported I went ahead and wrote the raw SQL code and executed it in SQLAlchemy connection and Pyscopg2's cursor as below and they both removed quotes and put Null where there were empty entries.

sql = "COPY table_name (col1, col2, col3, col4) FROM '{}' DELIMITER ',' CSV HEADER".format(csv_file_path)

SQL Alchemy:

conn = engine.connect()
trans = conn.begin()
conn.execute(sql)
trans.commit()
conn.close()

Psycopg2:

conn = psycopg2.connect(pg_conn_string)
conn.set_isolation_level(0)
cursor = conn.cursor(cursor_factory=psycopg2.extras.DictCursor)
cursor = conn.cursor()
cursor.execute(sql)

While still wishing the copy_from function would work, now am wondering if the above two are equally as fast as copy_from and if so, which of the two is faster?

1

1 Answers

0
votes

Probably a better approach would be to use the built-in CSV library to read the CSV file and transfer the rows to the database. Corollary to the UNIX philosophy of "do one thing and do it well" is to use the appropriate tool (the one that's specialized) for the job. What's good with the CSV library is that you have customization options on how to read the CSV like quoting characters and skipping initial rows (see documentation).

Assuming a simple CSV file with two columns: an integer "ID", and a quoted string "Country Code":

"ID", "Country Code"
1, "US"
2, "UK"

and a declarative SQLAlchemy target table:

from sqlalchemy import create_engine, Column, Integer, String
from sqlalchemy.ext.declarative import declarative_base


engine = create_engine("postgresql+psycopg2://<REMAINDER_OF_YOUR_ENGINE_STRING>")
Base = declarative_base(bind=engine)

class CountryTable(Base):
    __tablename__ = 'countries'

    id = Column(Integer, primary_key=True)
    country = Column(String)

you can transfer the data by:

import csv

from sqlalchemy.orm import sessionmaker

from your_model_module import engine, CountryTable


Session = sessionmaker(bind=engine)
with open("path_to_your.csv", "rb") as f:
    reader = csv.DictReader(f)
    session = Session()
    for row in reader:
        country_record = CountryTable(id=row["ID"], country=row["Country Code"])
        session.add(country_record)
        session.commit()
    session.close()

This solution is longer than a one line .copy_from method but it gives you better control without having to dig through the code/understanding the documentation of wrapper or convenience functions like .copy_from. You can specify selected columns to be transferred and handle exceptions at row level since data is transferred row by row with a commit. Rows can be transferred in a batch with a single commit through:

with open("path_to_your.csv", "rb") as f:
    reader = csv.DictReader(f)
    session = Session()
    session.add_all([
        CountryTable(id=row["ID"], country=row["Country Code"]) for row in reader
        ])
    session.commit()
    session.close()    

To compare the execution time of different approaches to your problem, use the timeit module (or rather the commandline command) that comes with Python. Caution however: it's better to be correct than fast.


EDIT:

I was trying to figure out where .copy_from is coded as I haven't used it before. It turns out to be a psycopg2 specific convenience function. It does not 100% support reading CSV files but only file like objects. The only customization argument applicable to CSVs is the separator. It does not understand quoting characters.