2
votes

I am trying to query bigquery from apache spark 2.2.0 using Simba JDBC Driver (bigquery jdbc 4.2)

This is my code:

import spark.implicits._
import org.apache.spark.sql.SQLContext
val sqlcontext = new org.apache.spark.sql.SQLContext(sc)
val pushdown_query = "(select * from mydb.mytable) AS T"
val dataframe_bq = sqlcontext.read.format("jdbc").option("url","jdbc:bigquery://https://www.googleapis.com/bigquery/v2:443;ProjectId=xxxx;OAuthType=0;[email protected];OAuthPvtKeyPath=/path/file.json;").option("driver", "com.simba.googlebigquery.jdbc42.Driver").option("dbtable", pushdown_query).load()
dataframe_bq.show()

This is the error:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 1 times, most recent failure: Lost task 0.0 in stage 3.0 (TID 3, localhost, executor driver): java.sql.SQLDataException: 

[Simba][JDBC](10140) Error converting value to Timestamp.

The conversion error may occur for any data type like also:

[Simba][JDBC](10140) Error converting value to Long.

Spark 2.x make every columnName gets double-quoted !

So for example SELECT UserID, UserName FROM USERS

Will be manipulated to SELECT "UserID", "UserName" FROM USERS

And the dataframe will look like this:

+------+--------+ 
|UserID|UserName| 
+------+--------+ 
|UserID|UserName| 
|UserID|UserName| 
|UserID|UserName| 
+------+--------+ 

The proposed solution is to write custom spark JDBC dialect but I am not yet able to get it running properly.

I appreciate if you can let me know how to solve this issue. Thanks in advance.

4
Did you ever resolve this? I have same issue.doug
is there any solution to this?Phoenix

4 Answers

0
votes

Faced the same issue.

If you need to use Spark JDBC, there no other options but using/implementing custom JDBC dialect. You can take a look at https://github.com/trK54Ylmz/spark-bigquery, it worked for me with minor changes.

If JDBC is not the requirement, you can also try one of the connectors available: from Spotify (in maintenance mode currently), Google's native connector for Dataproc (quite limited, unfortunately) or this third-party version which looks promising.

0
votes

To solve this issue, I had to edit the Apache Spark version 2.1 source code and recompile it to handle this issue using the following procedure:

1) Prerequisites

 Maven 3.3.9 or newer and Java 8+ 

2) Download spark source code, in my case it was version 2.1

3) Update the source code to prevent Spark from adding a double quote to column names!, search for the code below and replace it accordingly.

Source code refernece file: https://github.com/apache/spark/blob/branch-2.1/sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala#L90

edit this function:

def quoteIdentifier(colName: String): String = {
    s""""$colName""""
}

and make it like this :


def quoteIdentifier(colName: String): String = {
    return colName
}

4) build it with maven

  Install maven 3.3.9 or above and java 8 and then build it:
    ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn

Hope that will help you.

0
votes

When I use Simba jdbc driver for Spark, the option UseNativeQuery=0 solved the following issue:

[Simba][JDBC](10140) Error converting value to Timestamp

So hopefully I didn't have to develop any specific dialect.

0
votes

The issue is, one of the pre-registered jdbc dialect adds extra quotes around the field name.

so the query - select column1,column2 from table became select "column1","column2" from table

To resolve the issue in spark, add below code after creating spark context and before creating dataframe. [This solution is specifically for SIMBA driver].

JdbcDialects.registerDialect(new JdbcDialect() {

override def canHandle(url: String): Boolean = url.toLowerCase.startsWith("jdbc:bigquery:")

override

def quoteIdentifier(column: String): String =  column

})