We use AWS DMS to dump SQL Server DBs into S3 as parquet files. Idea is to run some analytics with spark over parquets. When a full load is complete then it's not possible to read parquets since they have UINT
fields in the schema. Spark declines to read them with Parquet type not supported: INT32 (UINT_8)
. We use transformation rules to overwrite data type of UINT
columns. But it looks like they are not picked up by DMS engine. Why?
There are number of rules like "convert unit to int" see below (mind UINT1 is 1 byte unsigned DMS DataTypes):
{
"rule-type": "transformation",
"rule-id": "7",
"rule-name": "uintToInt",
"rule-action": "change-data-type",
"rule-target": "column",
"object-locator": {
"schema-name": "%",
"table-name": "%",
"column-name": "%",
"data-type": "uint1"
},
"data-type": {
"type": "int4"
}
}
S3 DataFormat=parquet;ParquetVersion=parquet_2_0
and DMS Engine version is 3.3.2
However still getting parquet schemas with uint. See below:
id: int32
name: string
value: string
status: uint8
Attempt to read such parquet using spark gives me
org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_8);
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotSupported$1(ParquetSchemaConverter.scala:100)
at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:136)
Why the DMS transformation rule is not triggered?
Changes to the source table structure during full load are not supported
docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.html – Anton