1
votes

I have a working notebook at azure databricks version 6.3 - Spark 2.4.4

This notebook does ingestions into Azure Synapse Analytics using it's connector

When I upgraded the notebook to version 7.0 - Spark 3.0.0, the process begun to fail with the following error:

com.microsoft.sqlserver.jdbc.SQLServerException: HdfsBridge::recordReaderFillBuffer - Unexpected error encountered filling record reader buffer: ClassCastException: class java.lang.Long cannot be cast to class java.lang.Integer (java.lang.Long and java.lang.Integer are in module java.base of loader 'bootstrap') [ErrorCode = 106000] [SQLState = S0001]

This is the table schema in the Synapse Analytics:

CREATE TABLE [dbo].[IncrementalDestination]
(
[Id] [int] NOT NULL,
[VarChar] [varchar](1000) NULL,
[Char] [char](1000) NULL,
[Text] [varchar](1000) NULL,
[NVarChar] [nvarchar](1000) NULL,
[NChar] [nchar](1000) NULL,
[NText] [nvarchar](1000) NULL,
[Date] [date] NULL,
[Datetime] [datetime] NULL,
[Datetime2] [datetime2](7) NULL,
[Smalldatetime] [smalldatetime] NULL,
[Bigint] [bigint] NULL,
[Bit] [bit] NULL,
[Decimal] [decimal](18, 0) NULL,
[Int] [int] NULL,
[Money] [money] NULL,
[Numeric] [numeric](18, 0) NULL,
[Smallint] [smallint] NULL,
[Smallmoney] [smallmoney] NULL,
[Tinyint] [tinyint] NULL,
[Float] [float] NULL,
[Real] [real] NULL,
[Column With Space] [varchar](1000) NULL,
[Column_ç_$pecial_char] [varchar](1000) NULL,
[InsertionDateUTC] [datetime] NOT NULL,
[De_LastUpdated] [datetime2](3) NOT NULL
)
WITH
(
DISTRIBUTION = ROUND_ROBIN,
CLUSTERED COLUMNSTORE INDEX
)
GO

This is the schema generated by Databricks after reading a bunch of parquets in the Azure BlobStorage

root
 |-- Id: long (nullable = true)
 |-- VarChar: string (nullable = true)
 |-- Char: string (nullable = true)
 |-- Text: string (nullable = true)
 |-- NVarChar: string (nullable = true)
 |-- NChar: string (nullable = true)
 |-- NText: string (nullable = true)
 |-- Date: timestamp (nullable = true)
 |-- Datetime: timestamp (nullable = true)
 |-- Datetime2: timestamp (nullable = true)
 |-- Smalldatetime: timestamp (nullable = true)
 |-- Bigint: long (nullable = true)
 |-- Bit: boolean (nullable = true)
 |-- Decimal: long (nullable = true)
 |-- Int: long (nullable = true)
 |-- Money: double (nullable = true)
 |-- Numeric: long (nullable = true)
 |-- Smallint: long (nullable = true)
 |-- Smallmoney: double (nullable = true)
 |-- Tinyint: long (nullable = true)
 |-- Float: double (nullable = true)
 |-- Real: double (nullable = true)
 |-- Column_With_Space: string (nullable = true)
 |-- Column_ç_$pecial_char: string (nullable = true)
 |-- InsertionDateUTC: timestamp (nullable = true)
 |-- De_LastUpdated: timestamp (nullable = false)

I saw this

Int: long (nullable = true)

But what can I do?

Shouldn't this conversion be natural and easily done?

I think something broke with these new features =]

1

1 Answers

0
votes

I believe that this is caused by following change, as described in migration guide:

In Spark 3.0, when inserting a value into a table column with a different data type, the type coercion is performed as per ANSI SQL standard. Certain unreasonable type conversions such as converting string to int and double to boolean are disallowed. A runtime exception is thrown if the value is out-of-range for the data type of the column. In Spark version 2.4 and below, type conversions during table insertion are allowed as long as they are valid Cast. When inserting an out-of-range value to an integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of byte type, the result is 1. The behavior is controlled by the option spark.sql.storeAssignmentPolicy, with a default value as “ANSI”. Setting the option as “Legacy” restores the previous behavior.

So you may try to set spark.sql.storeAssignmentPolicy to Legacy and re-run the code.