0
votes

Thanks in advance.

Hello, I am using spark dataframe and scala for some data processing, I have a requirement where I need to read multiple columns with same data type, i.e. struct type in my case from the parquet file to process and create new dataframe with schema same as struct type fields i.e. field1,field2 and field3 and populate the dataframe with the data from all columns example shown below.

e.g suppose i have 3 columns

a)column1: struct (nullable = true)
     |-- field1: string (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: string (nullable = true)

b)column2: struct (nullable = true)
     |-- field1: string (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: string (nullable = true)

c)column3: struct (nullable = true)
     |-- field1: string (nullable = true)
     |-- field2: string (nullable = true)
     |-- field3: string (nullable = true)

I am able to read all the values from columns using below code snippet

dataframe.select("column1","column2","column3")

Above code return Row object

[[column1field1,column1field2,column1field3],null,null]
[null,[column2field1,column2field2,column2field3],null]
[null,null,[column3field1,column3field2,column3field3]]
[[column1field1,column1field2,some record, with multiple,separator],null,null]

The concern here is that I able to read values from the row object using "," separator and able to populate the dataframe with 3 fields, but as the fields are string, there are records in the parquet where I have multiple "," in the string data itself as show above in last Row object, thus causing a problem in dataframe schema as I am using "," separator to retrieve values of the Row object and it is giving me more than 3 fields. How can I get rid of this error? Is there any provision to change the Row array value's object separator in Spark to get this fixed?

1
I'm not sure I see where the error is not what you are asking for.eliasah
Yes eliasah, I have put a wrong title not revelant to this error I will change it. But I hope you understand the issuenilesh1212

1 Answers

1
votes

Yes, you can load with a different separator like

sqlContext.load("com.databricks.spark.csv", yourSchema, Map("path" -> yourDataPath, "header" -> "false", "delimiter" -> "^"))

OR

sqlContext.read.format("com.databricks.spark.csv").schema(yourSchema).options(Map("path" -> schema, "header" -> "false", "delimiter" -> "^")).load()

depending on which version of spark you're using.

As for the delimiters in your strings, you either need to escape them before loading with the ',' delimiter or use a different delimiter.