Bigquery - Simba JDBC error with Spark

Question

I am trying to query bigquery from apache spark 2.2.0 using Simba JDBC Driver (bigquery jdbc 4.2)

This is my code:

import spark.implicits._
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val pushdown_query = "(select * from mydb.mytable) AS T"
val dataframe_bq = sqlcontext.read.format("jdbc").option("url","jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=xxxx;OAuthType=0;[email protected];OAuthPvtKeyPath=/path/file.json;").option("driver", "com.simba.googlebigquery.jdbc42.Driver").option("dbtable", pushdown_query).load()
dataframe_bq.show()

This is the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.sql.SQLDataException: 

[Simba][JDBC](10140) Error converting value to Timestamp.

The conversion error may occur for any data type like also:

[Simba][JDBC](10140) Error converting value to Long.

Spark 2.x make every columnName gets double-quoted !

So for example SELECT UserID, UserName FROM USERS

Will be manipulated to SELECT "UserID", "UserName" FROM USERS

And the dataframe will look like this:

+------+--------+ 
|UserID|UserName| 
+------+--------+ 
|UserID|UserName| 
|UserID|UserName| 
|UserID|UserName| 
+------+--------+

The proposed solution is to write custom spark JDBC dialect but I am not yet able to get it running properly.

I appreciate if you can let me know how to solve this issue. Thanks in advance.

Rodion Myronov Rodion Myronov · Accepted Answer · 2018-11-08T22:01:15

Faced the same issue.

If you need to use Spark JDBC, there no other options but using/implementing custom JDBC dialect. You can take a look at https://github.com/trK54Ylmz/spark-bigquery, it worked for me with minor changes.

If JDBC is not the requirement, you can also try one of the connectors available: from Spotify (in maintenance mode currently), Google's native connector for Dataproc (quite limited, unfortunately) or this third-party version which looks promising.

Bigquery - Simba JDBC error with Spark

4 Answers