11
votes

I am trying to write a DataFrame as a CSV file using Spark-CSV (https://github.com/databricks/spark-csv)

I am using the command below

res1.write.option("quoteMode", "NONE").format("com.databricks.spark.csv").save("File")

But my CSV file is always written as

"London"
"Copenhagen"
"Moscow"

instead of

London
Copenhagen
Moscow

7

7 Answers

17
votes

Yes. The way to turn off the default escaping of the double quote character (") with the backslash character (\), you must add an .option() method call with just the right parameters after the .write() method call. The goal of the option() method call is to change how the csv() method "finds" instances of the "quote" character. To do this, you must change the default of what a "quote" actually means; i.e. change the character sought from being a double quote character (") to a Unicode "\u0000" character (essentially providing the Unicode NUL character which won't ever occur within a well formed JSON document).

val dataFrame =
  spark.sql("SELECT * FROM some_table_with_a_json_column")
val unitEmitCsv =
  dataframe
    .write
    .option("header", true)
    .option("delimiter", "\t")
    .option("quote", "\u0000") //magic is happening here
    .csv("/FileStore/temp.tsv")

This was only one of several lessons I learned attempting to work with Apache Spark and emitting .csv files. For more information and context on this, please see the blog post I wrote titled "Example Apache Spark ETL Pipeline Integrating a SaaS".

5
votes

The double quoting of the text can be removed by setting the quoteAll option to false

dataframe.write
 .option("quoteAll", "false")
 .format("csv")

This example is as per Spark 2.1.0 with out using the databricks lib.

3
votes

If your DataFrame has a single string you can write out a text file directly.

df.coalesce(1).map({ k:Row => k(0).toString}).toJavaRDD.saveAsTextFile("File")

If you have multiple columns you can combine them as a single string before writing to the output file.

The other answers given may result in unwanted null or space characters being emitted in your output file.

2
votes

Use option

.option("emptyValue", "")

That's in spark 2.4+

0
votes

this problem bothers me for a long time until I read this: Adding custom Delimiter adds double quotes in the final spark data frame CSV outpu

This is a standard CSV feature. If there's an occurrence of delimiter in the actual data (referred to as Delimiter Collision), the field is enclosed in quotes. You can try df.write.option("delimiter" , somechar) where somechar should be a character that doesn't occur in your data.

You can just concat multiple columns into one and use a delimiter that is not in your data

0
votes

There are some similar conditions I have ever been confuesd, Finally I find a sep parameter can change the result, you can try this:

df.write.mode("overwrite").option("sep","\t").csv(path)
-3
votes

I was able to turn that off by setting the quote option to be a single white space

df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("quote"," ").option("codec", "org.apache.hadoop.io.compress.GzipCodec").save("File path")

But this will just replace option which put space in place of quote (")

There is one more option i.e., quote generally occurs as qualifier to separate some column when delimiter and separator are same

so you can change delimiter and get rid of quote automatically

df.coalesce(1).write.format("com.databricks.spark.csv").option("header", "true").option("delimiter", "|").option("codec", "org.apache.hadoop.io.compress.GzipCodec").save("File path")

Hope this works in your case