I am trying to query bigquery from apache spark 2.2.0 using Simba JDBC Driver (bigquery jdbc 4.2)
This is my code:
import spark.implicits._
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val pushdown_query = "(select * from mydb.mytable) AS T"
val dataframe_bq = sqlcontext.read.format("jdbc").option("url","jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=xxxx;OAuthType=0;[email protected];OAuthPvtKeyPath=/path/file.json;").option("driver", "com.simba.googlebigquery.jdbc42.Driver").option("dbtable", pushdown_query).load()
dataframe_bq.show()
This is the error:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.sql.SQLDataException:
[Simba][JDBC](10140) Error converting value to Timestamp.
The conversion error may occur for any data type like also:
[Simba][JDBC](10140) Error converting value to Long.
Spark 2.x make every columnName gets double-quoted !
So for example SELECT UserID, UserName FROM USERS
Will be manipulated to SELECT "UserID", "UserName" FROM USERS
And the dataframe will look like this:
+------+--------+
|UserID|UserName|
+------+--------+
|UserID|UserName|
|UserID|UserName|
|UserID|UserName|
+------+--------+
The proposed solution is to write custom spark JDBC dialect but I am not yet able to get it running properly.
I appreciate if you can let me know how to solve this issue. Thanks in advance.