3
votes

I'm trying to read a CSV file that uses backslash to escape delimiters instead of using quotes. I've tried constructing the DataFrameReader without qoutes and with an escape character, but it doesn't work. It seems the "escape" option can only be used to escape quote characters. Is there any way around this other than crating a custom input format?

Here are the options that I'm using for now:

  spark.read.options(Map(
    "sep" -> ",",
    "encoding" -> "utf-8",
    "quote" -> "",
    "escape" -> "\\",
    "mode" -> "PERMISSIVE",
    "nullValue" -> ""

For example let's say we have the following sample data:

Schema: Name, city

    Joe Bloggs,Dublin\,Ireland
    Joseph Smith,Salt Lake City\,\
    Utah

That should return 2 records:

  Name           |       City
-----------------|---------------
Joe Bloggs       | Dublin,Ireland
Joseph Smith     | Salt Lake City,
Utah

Being able to escape newlines would be a nice-to-have, but escaping the column delimiter is required. For now I'm thinking about reading the lines with spark.textFile, then using some CSV library to parse the individual lines. That will fix my escaped column delimiter problem, but not escaped row delimiters.

2
I think you are right, pls checkRam Ghadiyaram
Spark 2.0 actually folds the databricks csv InputFormat into DataFrameReader. I haven't tried reverting to the databricks version yet, but nothing I've seen so far suggests that it would behave any differently.Paul Zaczkiewicz

2 Answers

1
votes

It seems like this is not supported in the CSV reader (see https://github.com/databricks/spark-csv/issues/390).

I'm going to guess that the easiest way around this is to parse your rows manually; not at all ideal but still functional and not too hard.

You can split your lines using a negative lookbehind regex, e.g. (?<!\\), - this will match any comma not preceded by a backslash.

0
votes

I am also getting same issue with Spark-2.3 . But when I tried with Spark-1.6 which use Apache: commons-csv by default for parsing csv there it was parsed fine with the option("escape","\\") When I used option("parserLib","univocity") in Spark-1.6 then it started giving error. So my understanding is univocity was not able to handle it.

In Spark-2 the csv parser is Univocity. I was not able to use "commons" parserlib in Spark-2. I