I am trying to process JSON column from PostgreSQL database. I am able to connect to database using:
import os
import pyspark
import findspark
from pyspark import SparkContext
from pyspark.sql import SQLContext
findspark.init(os.environ['SPARK_HOME'])
# DB credentials
user = os.environ['EVENTS_DEV_UID']
password = os.environ['EVENTS_DEV_PWD']
host = os.environ['EVENTS_DEV_HOST']
port = os.environ['EVENTS_DEV_PORT']
db = os.environ['EVENTS_DEV_DBNAME']
# Initiate spark session
sc = SparkContext()
spark = SQLContext(sc)
# Set properties
properties = {"user": user, "password": password, "driver": "org.postgresql.Driver"}
# Load data
df = spark.read.jdbc(
url = 'jdbc:postgresql://' + host + ':' + port + '/' + db,
table = 'events',
properties = properties)
The problem starts with casting JSON field. Spark doesn't recognize struct format of params
. When I print schema:
df.printSchema()
root
|-- time: timestamp (nullable = true)
|-- name: string (nullable = true)
|-- params: string (nullable = true)
When I try to cast string to struct:
df = df.withColumn('params', df.params.cast('struct'))
I am getting following error:
ParseException: '\nDataType struct is not supported.(line 1, pos 0)\n\n== SQL ==\nstruct\n^^^\n'
I guess problem is the escape characters. Anybody has idea how to proceed?