2
votes

We use AWS DMS to dump SQL Server DBs into S3 as parquet files. Idea is to run some analytics with spark over parquets. When a full load is complete then it's not possible to read parquets since they have UINT fields in the schema. Spark declines to read them with Parquet type not supported: INT32 (UINT_8). We use transformation rules to overwrite data type of UINT columns. But it looks like they are not picked up by DMS engine. Why?

There are number of rules like "convert unit to int" see below (mind UINT1 is 1 byte unsigned DMS DataTypes):

{
  "rule-type": "transformation",
  "rule-id": "7",
  "rule-name": "uintToInt",
  "rule-action": "change-data-type",
  "rule-target": "column",
  "object-locator": {
    "schema-name": "%",
    "table-name": "%",
    "column-name": "%",
    "data-type": "uint1"
  },
  "data-type": {
    "type": "int4"
  }
}

S3 DataFormat=parquet;ParquetVersion=parquet_2_0 and DMS Engine version is 3.3.2

However still getting parquet schemas with uint. See below:

id: int32
name: string
value: string
status: uint8

Attempt to read such parquet using spark gives me

org.apache.spark.sql.AnalysisException: Parquet type not supported: INT32 (UINT_8);
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.typeNotSupported$1(ParquetSchemaConverter.scala:100)
    at org.apache.spark.sql.execution.datasources.parquet.ParquetToSparkSchemaConverter.convertPrimitiveField(ParquetSchemaConverter.scala:136)

Why the DMS transformation rule is not triggered?

2
Is it limitation I hit? Changes to the source table structure during full load are not supported docs.aws.amazon.com/dms/latest/userguide/CHAP_Target.S3.htmlAnton
Did you figure out how to make this work? Thank you.jcunhafonte

2 Answers

2
votes

Transforming the data directly from UINT to INT on DMS fixes this issue. Your mapping rules should look like:

{
"rules": [
    ...
    {
        "rule-type": "transformation",
        "rule-id": "2",
        "rule-name": "unit1-to-int1",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint1"
        },
        "data-type": {
            "type": "int1"
        }
    },
    {
        "rule-type": "transformation",
        "rule-id": "3",
        "rule-name": "unit2-to-int2",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint2"
        },
        "data-type": {
            "type": "int2"
        }
    },
    {
        "rule-type": "transformation",
        "rule-id": "4",
        "rule-name": "unit4-to-int4",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint4"
        },
        "data-type": {
            "type": "int4"
        }
    },
    {
        "rule-type": "transformation",
        "rule-id": "5",
        "rule-name": "unit8-to-int8",
        "rule-action": "change-data-type",
        "rule-target": "column",
        "object-locator": {
            "schema-name": "acessa",
            "table-name": "%",
            "column-name": "%",
            "data-type": "uint8"
        },
        "data-type": {
            "type": "int8"
        }
    }
]}

Documentation: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_Tasks.CustomizingTasks.TableMapping.html#CHAP_Tasks.CustomizingTasks.TableMapping.SelectionTransformation.Transformations

0
votes

The only way I was able to get the transformation working on the parquet files was by specifying the exact column to transform. So for example:

{
   "rules": [
   ...
   {
    "rule-type": "transformation",
    "rule-id": "2",
    "rule-name": "unit1-to-int1",
    "rule-action": "change-data-type",
    "rule-target": "column",
    "object-locator": {
        "schema-name": "acessa",
        "table-name": "<table_name>",
        "column-name": "<column_name>"
    },
    "data-type": {
        "type": "int1"
    }
   }
  ]
}

Using the wildcard % for the column name in the object locator just wasn't working