0
votes

I am using spark-csv to convert a dataframe to CSV before I can load in into a redshift table. The spark-csv adds a backslash to escape the double quotes which is what I would expect. But when the copy command when it tries to load into redshift, it adds a backsplash to backsplash. How can I prevent it?

Here is a sample code:

df= sqlContext.createDataFrame([("1A", 123456, "first example"), ("1B", 123457, "It's a \"problem\"")], ['id', 'time', 'xstr'])

+---+------+----------------+
| id|  time|            xstr|
+---+------+----------------+
| 1A|123456|   first example|
| 1B|123457|It's a "problem"|
+---+------+----------------+

df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "false")\
.option("quoteMode", "NON_NUMERIC")\
.save(fname)

RS Copy command: COPY xxxx FROM YYYY delimiter ',' FORMAT CSV

It errors out and the STL_LOAD_ERRORS table contains this:

1B,123457,"It's a \\"problem\\""    

How do I prevent redshift from escaping an escape character?

1
BTW, you probably also want to use REMOVEQUOTES to take the outer quotes off your text strings. - John Rotenstein
Good catch. Thank you. - Arvind Kandaswamy

1 Answers

1
votes

Try including the ESCAPE keyword in your COPY statement, e.g.

COPY xxxx FROM YYYY 
delimiter ',' 
FORMAT CSV
ESCAPE
;