Spark SQL: array_contains and auto-inserted casts

Question

We've run into an issue when migrating to the newer Spark version and totally not sure how to troubleshoot it.

We have two Spark instances, first of them is of version 2.3.0 and second - of version 2.4.0. Both instances receive the same command:

spark.sql("SELECT array_contains(array(1), '1')")

On older version, we get the following:

[{"array_contains(array(1), CAST(1 AS INT))":true}]

i.e. parameter is auto-casted to match the other one. On newer version, this is an error:

cannot resolve 'array_contains(array(1), '1')' due to data type mismatch: Input to function array_contains should have been array followed by a value with same element type, but it's [array<int>, string].; line 1 pos 7;
'Project [unresolvedalias(array_contains(array(1), 1), None)]
+- OneRowRelation

Since we don't explicitly control neither the real data types nor the SQL code passed to us (they are managed by our customers), we'd like to understand whether they have to change the data or we can work around this issue ourselves. Is there something we can do at the Spark side?

If there's something we should check other then the Spark versions, feel free to leave the comment, I'll add the necessary data to the question.

blackbishop blackbishop · Accepted Answer · 2019-12-09T12:32:51

This is actually cited in the Spark Upgrading Guide:

In Spark version 2.3 and earlier, the second parameter to array_contains function is implicitly promoted to the element type of first array type parameter. This type promotion can be lossy and may cause array_contains function to return wrong result. This problem has been addressed in 2.4 by employing a safer type promotion mechanism

So in short, the implicit cast of the second parameter to type in the array was removed in version 2.4 and you have to explicitly pass the good type:

spark.sql("SELECT array_contains(array(1), 1)")

There is another function introduced in Spark 2.4 in builtin higher-order functions: exists that does the cast but it's not the same syntax:

spark.sql("SELECT exists(array(1), x -> x=='1')").show()

It takes the array column and a lambda function and it is transfomed to:

exists(array(1), lambdafunction((namedlambdavariable() = CAST(1 AS INT)), namedlambdavariable()))

As you can see the cast is done by Spark.

Spark SQL: array_contains and auto-inserted casts

1 Answers