5
votes

Having a very weird problem with Glue. Using it to run some ETL on data I'm moving from MySQL RDS to Redshift. Using the same code I used on another table, where it worked fine and copied all the data as it should have.

However on the second table, for some reason it doesn't copy the data in the id column from MySQL. The id column on Redshift is completely blank.

query_df = spark.read.format("jdbc").option("url", 
args['RDSURL']).option("driver", 
args['RDSDRIVER']).option("dbtable", 
args['RDSQUERY']).option("user", args['RDSUSER']).option("password", 
args['RDSPASS']).load()

datasource0 = DynamicFrame.fromDF(query_df, glueContext, 
"datasource0")

logging.info(datasource0.show())

applymapping1 = ApplyMapping.apply(frame = datasource0, mappings = 
[("id", "int", "id", "int"), ... , transformation_ctx = 
"applymapping1")

logging.info(applymapping1.show())

From the above logs I print above I can see that the Dynamic Frame contains the id field even after ApplyMapping.

datasink2 = glueContext.write_dynamic_frame.from_jdbc_conf(frame = 
applymapping1, catalog_connection = args['RSCLUSTER'], 
connection_options = {"dbtable": args['RSTABLE'], "database": 
args['RSDB']}, redshift_tmp_dir = args["TempDir"], 
transformation_ctx = "datasink2")

The problem seems to happening here I think? After this the job completes, on checking Redshift the id column is completely empty.

Very puzzled by this behaviour. The exact code worked fine on another table, the only differences between the id in these two tables is that this table has id as int (11) unsigned while the table where the code worked had id as int (10) signed.

1
Can you remove data type of source column in apply mapping as [("id", "id", "int")] and retry loading the data again.If it's not working check the schema and data immediately before writing the data to redshiftPrabhakar Reddy
I did do the latter, i.e. I printed the contents of applymapping1 to the logs and it did contain the id values. For some reason glueContext.write_dynamic_frame is not writing the id in to the Redshift table. In any case, I've decided to move away from Glue for the time being in favor of Lambda and EC2.Suresh Kasipandy
How do add signed/unsigned in redshift? Redshift doesn't have such datatypeSandeep Fatangare
@SandeepFatangare Thats not the point. MySQL does, and if glue can’t convert data types to match that of Redshift it’s pretty useless. In comparison, the pandas-redshift package in python could move all the data. It could even create a table in Redshift with appropriate data types for columns based on the definition of the table I pulled from MySQL.Suresh Kasipandy
Not sure if bigint will work instead of int. Unsigned will increase int limit which may not be case in redshift as there is no unsigned int datatype in redshift. So safer bet is to go to next higher datatype. You may use pandas-redshift with new feature added in glue ' python shell. docs.aws.amazon.com/glue/latest/dg/aSandeep Fatangare

1 Answers

1
votes

I've had exactly this behaviour with extracting from MySQL RDS using Glue. For anyone seeking the answer to this - the reason is as follows: AWSGlue has the concept of a 'type choice' where the exact type of a crawled column can remain as a number of possibilities throughout the ETL Job, since the crawler only crawls a subset of a column's data to determine the probable type and doesn't decide definitively. This is why converting to use an explicit schema rather than a crawler will fix the issue as it doesn't involve any type choices.

When the job runs (or you look at a preview) Spark will attempt to process the entire column dataset. At this point it's possible that the column type is resolved to a type that is incompatible with the data set - i.e the interpreter can't decide on the right type choice, and this results in empty data for the column in question. I have experienced this in transforming a number of tables from a MySQL DB, and there's no apparent pattern on why some fail and some don't that I've been able to determine although it must be related to the data in the source DB column.

The solution is to add into your script an explicit resolution of the choices, by casting the column that's failing to the desired target type with something like the following:

df.resolveChoice(specs = [('id', 'cast:int')])

Where df is the data frame. This will force the column to be interpreted as the intended type and should result in the expected output of data in this column. This has always worked for me.

Note that for those using the Glue Studio visual editor it's now possible to add a 'Custom Transformation' step which contains code to perform this for you. In this instance the code for the transformation should look as follows:

def MyTransform (glueContext, dfc) -> DynamicFrameCollection:
df = dfc.select(list(dfc.keys())[0])
df_resolved = df.resolveChoice(specs = [('id', 'cast:int')])
return (DynamicFrameCollection({"CustomTransform0": df_resolved}, glueContext))

Note also that in this scenario that it will be necessary to follow this Custom Transformation node with a 'Select from Collection' transformation since the Custom Transformation returns a collection rather than a single frame.