Storing parquet file into PostgreSQL Database

Question

I would like to write parquet files to PostgreSQL. I am using Spark and to write the file I am using Spark Dataframe's write.jdbc function. Everything works fine for the parquet column types like long, decimal or text. The problem is with the complex types like Map. I would like to to store Map as json in my PostgreSQL. Since I know that PostgreSQL can convert text datatype into json automatically(using the cast operation) so I am dumping map to a json string.

But the spark program complains that we are trying to insert "character varying" datatype into a column of "json" type. Which make it clear that PostgreSQL is not converting "character varying" into JSON automatically.

I went ahead and logged into my database and manually tried to insert a JSON string into a JSON datatype colum of a table and it worked.

My question is why my spark program is complaining for the cast operation?

I am using Spark version 1.6.1, PostgreSQL 4.3 and JDBC 42.1.1

Here is the code snippet

url = "jdbc:postgresql://host_name:host_port/db_name"
data_frame.write.jdbc(url, table_name, properties={"user": user, "password": password})

Error stack trace:

Hint: You will need to rewrite or cast the expression.
  Position: 66  Call getNextException to see other errors in the batch.
    at org.postgresql.jdbc.BatchResultHandler.handleError(BatchResultHandler.java:148)
    at org.postgresql.core.ResultHandlerDelegate.handleError(ResultHandlerDelegate.java:50)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2190)
    at org.postgresql.core.v3.QueryExecutorImpl.flushIfDeadlockRisk(QueryExecutorImpl.java:1325)
    at org.postgresql.core.v3.QueryExecutorImpl.sendQuery(QueryExecutorImpl.java:1350)
    at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:458)
    at org.postgresql.jdbc.PgStatement.executeBatch(PgStatement.java:791)
    at org.postgresql.jdbc.PgPreparedStatement.executeBatch(PgPreparedStatement.java:1547)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.savePartition(JdbcUtils.scala:215)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:277)
    at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anonfun$saveTable$1.apply(JdbcUtils.scala:276)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1$$anonfun$apply$33.apply(RDD.scala:920)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1858)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    ... 1 more
Caused by: org.postgresql.util.PSQLException: ERROR: column "value" is of type json but expression is of type character varying
  Hint: You will need to rewrite or cast the expression.
  Position: 66
    at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2476)
    at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2189)
    ... 18 more

the output of data_frame.show() would also greatly help to give you an answer. — Roberto Congiu

unknown unknown · Accepted Answer · 2018-10-19T13:52:36

Its very late, but here is the answer for any lost souls.

You would need to pass "stringtype" argument to JDBC. It specifies the type to use when binding PreparedStatement parameters set via setString(). By default, it is varchar, which enforces that argument is varchar and prevents any cast operation(in my case JSON string to JSON). If we specify, stringtype=="unspecified" then it leaves it to the database to decide about which type the argument is. In my case, it helps in the way that Postgres would convert string to JSON readily.

Documentation: https://jdbc.postgresql.org/documentation/head/connect.html

Storing parquet file into PostgreSQL Database

2 Answers