0
votes

I would like to create a Spark dataframe (without double quotes) by reading input from csv file as mentioned below.

enter image description here

Here is my code, but no use so far.

val empDF = spark.read.format("com.databricks.spark.csv")
  .option("header", "true")
  .option("inferSchema", "true")
  .option("quote", "\"")
  .option("escape", "\"")
  .load("EmpWithQuotes.csv")
  .toDF()

My expected output is not to add double quotes to out but I am getting an output with junk.

+---+-----+----------+----+
|eno|ename|      eloc|esal|
+---+-----+----------+----+
| 11|�abx�| �chennai�|1000|
| 22|�abr�|     �hyd�|3000|
3
is it possible to post exact data instead of image ? - Srinivas

3 Answers

0
votes

I tried this with Spark over Scala and it removed the quotes from the columns:

df = df.withColumn("ename", regexp_replace(col("ename"), "“", ""))
    .withColumn("eloc", regexp_replace(col("eloc"), "“", ""))
    .withColumn("ename", regexp_replace(col("ename"), "”", ""))
    .withColumn("eloc", regexp_replace(col("eloc"), "”", ""))

There must be something similar in the Python API of Spark too....

0
votes

It looks like they are not normal double quotes. You could try to find which character is and escape it, or, you can take the substring (if you are confident every row has leading and trailing quotes):

empDF.withColumn("ename", substring(col("ename"), 1, length(col("ename"))-2))
0
votes

If you can use Spark's default csv format and not com.databricks.spark.csv it should work as expected

import org.apache.spark.sql.functions._

object EscapeQuotes {
  def main(args: Array[String]): Unit = {
    val spark = Constant.getSparkSess
    val pattern = "“|”"
    spark.read
      .option("header", "true")
      .option("inferSchema", "true")
      .option("quote", "\"")
      .option("escape", "\"")
      .csv("src/main/resources/sample.csv")
      .withColumn("eloc",regexp_replace(col("eloc"),pattern,""))
      .withColumn("ename",regexp_replace(col("ename"),pattern,""))
      .show()
  }

}