Spark 2.0 Scala - Read csv files with escaped delimiters

Question

I'm trying to read a CSV file that uses backslash to escape delimiters instead of using quotes. I've tried constructing the DataFrameReader without qoutes and with an escape character, but it doesn't work. It seems the "escape" option can only be used to escape quote characters. Is there any way around this other than crating a custom input format?

Here are the options that I'm using for now:

  spark.read.options(Map(
    "sep" -> ",",
    "encoding" -> "utf-8",
    "quote" -> "",
    "escape" -> "\\",
    "mode" -> "PERMISSIVE",
    "nullValue" -> ""

For example let's say we have the following sample data:

Schema: Name, city

    Joe Bloggs,Dublin\,Ireland
    Joseph Smith,Salt Lake City\,\
    Utah

That should return 2 records:

  Name           |       City
-----------------|---------------
Joe Bloggs       | Dublin,Ireland
Joseph Smith     | Salt Lake City,
Utah

Being able to escape newlines would be a nice-to-have, but escaping the column delimiter is required. For now I'm thinking about reading the lines with spark.textFile, then using some CSV library to parse the individual lines. That will fix my escaped column delimiter problem, but not escaped row delimiters.

Spark 2.0 actually folds the databricks csv InputFormat into DataFrameReader. I haven't tried reverting to the databricks version yet, but nothing I've seen so far suggests that it would behave any differently. — Paul Zaczkiewicz

Kirk Broadhurst Kirk Broadhurst · Accepted Answer · 2017-10-20T20:13:46

It seems like this is not supported in the CSV reader (see https://github.com/databricks/spark-csv/issues/390).

I'm going to guess that the easiest way around this is to parse your rows manually; not at all ideal but still functional and not too hard.

You can split your lines using a negative lookbehind regex, e.g. (?<!\\), - this will match any comma not preceded by a backslash.

Spark 2.0 Scala - Read csv files with escaped delimiters

2 Answers