3
votes

I am using the JDBC driver of Simba Technologies Inc to get connected with the Google cloud spanner. It is running as expected with Java.sql. when, I tried to use the simba JDBC driver with the Spark's JDBC reader in order to read query output as DataFrame but, it is giving wrong output.

Here is my spanner table:

UserID  UserName
1   Vaijnath
2   Ganesh
3   Rahul

MetaData: UserID(String)
UserName(String)

I am executing Query as: SELECT * FROM users

This query fetch correct data when I use Simba JDBC driver with Java Sql, but it fails to fetch data When I use it with Spark SQL's JDBC reader.

It returns the DataFrame as

+------+--------+
|UserID|UserName|
+------+--------+
|UserID|UserName|
|UserID|UserName|
|UserID|UserName|
+------+--------+

As we can see, it is returning correct metadata and number of rows but, row contains the column names.

Here is the code I am using:

import java.util.Properties
import org.apache.spark.sql.{DataFrame, SparkSession}

object  spannerIn {
    val sparkSession =SparkSession
            .builder()
            .appName("Spark SQL basic example").master("local")
            .config("spark.sql.warehouse.dir", "file:///tmp")
            .config("spark.sql.shuffle.partitions", 1)
            .getOrCreate()

    val properties =new Properties()
    properties.setProperty("user", "")
    properties.setProperty("password", "")
    properties.setProperty("driver", "com.simba.cloudspanner.core.jdbc42.CloudSpanner42Driver")

    val connectionURL="jdbc:cloudspanner://localhost;Project=abc;Instance=pqr;Database=xyz;PvtKeyPath=FilePath"
    val selectQuery="(select * from users)"
    def main(args: Array[String]): Unit = {
            val df = createJdbcDataframe()
            df.show()
    }
    def createJdbcDataframe(): DataFrame = {
    sparkSession.read.jdbc(connectionURL, selectQuery, properties)
    }
}

My question is, can I Use the Simba JDBC Driver with Spark? If Yes, then what extra things I need to add. Any help Appreciated.

1

1 Answers

3
votes

This occurs because Spark by default quote all identifiers using a double quote ("), meaning the following query is being generated:

SELECT "UserID", "UserName" FROM USERS

This is interpreted by Cloud Spanner as selecting two fixed strings. It's basically the same as this in most other databases:

SELECT 'UserID', 'UserName' FROM USERS

Google Cloud Spanner uses backticks (`) for quoting identifiers, and expects this:

SELECT `UserID`, `UserName` FROM USERS

To fix this, you need to register a specific JDBC dialect for Google Cloud Spanner and register the backtick for quoting like this:

    Class.forName("nl.topicus.jdbc.CloudSpannerDriver");
    SparkSession spark = SparkSession.builder().appName("Java Spark SQL basic example")
                .config("spark.some.config.option", "some-value").master("local").getOrCreate();
    String sparkURL = "jdbc:cloudspanner://localhost;Project=project-id;Instance=instance-id;Database=db;PvtKeyPath=pathToKeyFile.json";
    JdbcDialects.registerDialect(new JdbcDialect()
    {
        private static final long serialVersionUID = 1L;

        @Override
        public boolean canHandle(String url)
        {
            return url.toLowerCase().startsWith("jdbc:cloudspanner:");
        }

        @Override
        public String quoteIdentifier(String column)
        {
            return "`" + column + "`";
        }
    });
    Dataset<Row> dataset = spark.read().jdbc(sparkURL, "ACCOUNT", new Properties());
    dataset.show();

Please note that I have not tested the above with the Simba driver, but only with this driver: https://github.com/olavloite/spanner-jdbc I guess it should work with the Simba driver as well.