2
votes

I'm trying read data from a JSON which has an array having lat, long values something like [48.597315,-43.206085] and I want to parse them in spark sql as a single string. is there a way I can do that?

my JSON input will look something like below.

{"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}

I'm trying to push this to a rdbms store and when I'm trying to cast position.coordinates to string it's giving me

Can't get JDBC type for array<string> 

as the destination datatype is nvarchar. any kind help is appreciated.!

2
I assume that you will get an answer much faster if you actually provide an minimal reproducible example that shows clearly what the problem is. Currently, it's a bit too short... - Andrey Tyukin
updated the question. @AndreyTyukin - knowledge_seeker

2 Answers

1
votes

You can read your json file into a DataFrame, then 1) use concat_ws to stringify your lat/lon array into a single column, and 2) use struct to re-assemble the position struct-type column as follows:

// jsonfile:
// {"id":"11700","position":{"type":"Point","coordinates":[48.597315,-43.206085]}}

import org.apache.spark.sql.functions._
val df = spark.read.json("/path/to/jsonfile")

// printSchema:
// root
//  |-- id: string (nullable = true)
//  |-- position: struct (nullable = true)
//  |    |-- coordinates: array (nullable = true)
//  |    |    |-- element: double (containsNull = true)
//  |    |-- type: string (nullable = true)

df.withColumn("coordinates", concat_ws(",", $"position.coordinates")).
  select($"id", struct($"coordinates", $"position.type").as("position")).
  show(false)
// +-----+----------------------------+
// |id   |position                    |
// +-----+----------------------------+
// |11700|[48.597315,-43.206085,Point]|
// +-----+----------------------------+

// printSchema:
// root
//  |-- id: string (nullable = true)
//  |-- position: struct (nullable = false)
//  |    |-- coordinates: string (nullable = false)
//  |    |-- type: string (nullable = true)

[UPDATE]

Using Spark SQL:

df.createOrReplaceTempView("position_table")

spark.sql("""
    select id, concat_ws(',', position.coordinates) as position_coordinates
    from position_table
  """).
  show(false)
//+-----+--------------------+
//|id   |position_coordinates|
//+-----+--------------------+
//|11700|48.597315,-43.206085|
//|11800|49.611254,-43.90223 |
//+-----+--------------------+
0
votes

You have to transform the given column into a string before loading it into the target datasource. For example, the following code creates a new column position.coordinates with value as joined string of given arrays of double, by using Array's toString and removing the brackets afterward.

df.withColumn("position.coordinates", regexp_replace($"position.coordinates".cast("string"), "\\[|\\]", ""))

Alternatively, you can use UDF to do create a custom transformation function on Row objects. That way you can maintain the nested structure of the column. The following source (answer number 2) can give you some idea how to take up UDF for your case: Spark UDF with nested structure as input parameter.