Is there a way to simply truncate columns while inserting into Redshift via Glue?

Question

I have a column that is larger than the varchar(max) data type, which by my understanding is the largest data type that AWS Glue utilizes, and get the error "String length exceeds DDL length" when I try to load my tables because of it. I am not trying to truncate that column because it isn't all that important and cannot figure out how to do that in Glue. I know I can use TRUNCATECOLUMNS as a tag on a copy command if I connect to my database using psql in an EC2 instance and can actually get my tables to load successfully that way. However, my boss insists that I use Glue for this job so I am looking for a way to truncate the columns using a Glue script. I have looked through a lot of documentation but cannot find something similar. Thank you.

Here is some working code for anyone else that may have this issue and needs a complete reference. Note that varchar(65535) is the maximum number of characters a column can be in Redshift:

val truncColUdf = udf((str: String) => if (str.length > 29999) str.substring(0, 29999) else str)

val datasource30 = glueContext.getCatalogSource(database = "database", tableName = "entry", redshiftTmpDir = "", transformationContext = "datasource30").getDynamicFrame()
val revDF30 = datasource30.toDF()
  .withColumn("message", truncColUdf(col("message")))
val truncDynamicFrame30 = DynamicFrame(revDF30, glueContext)
val applymapping30 = truncDynamicFrame30.applyMapping(mappings = Seq(("id", "bigint", "id", "bigint"), ("message", "string", "message", "varchar(65535)"), ("state", "string", "state", "varchar(256)"), ("created_at", "timestamp", "created_at", "timestamp"), ("depth", "int", "depth", "int")), caseSensitive = false, transformationContext = "applymapping30")
val resolvechoice30 = applymapping30.resolveChoice(choiceOption = Some(ChoiceOption("make_cols")), transformationContext = "resolvechoice30")
val dropnullfields30 = resolvechoice30.dropNulls(transformationContext = "dropnullfields30")
val datasink30 = glueContext.getJDBCSink(catalogConnection = "databaseConnection", options = JsonOptions("""{"dbtable": "entry", "database": "database"}"""), redshiftTmpDir = args("TempDir"), transformationContext = "datasink30").writeDynamicFrame(dropnullfields30)

Here is an example line of data being read:

01,"<p>Here is the message where the quotations are in case of commas within the message, like so.</p>",active,2017-08-27 23:38:40,1

Yuriy Bondaruk Yuriy Bondaruk · Accepted Answer · 2018-08-03T01:50:43

Convert DynamicFrame to spark's DataFrame and then use user defined function to truncate a column value (Scala):

import com.amazonaws.services.glue.DynamicFrame
import org.apache.spark.sql.functions._

val truncColUdf = udf((str: String) => if (str.length > 20) str.substring(0, 20) else str)
val truncDataFrame = dynamicFrame.toDF()
  .select("text_long")
  .withColumn("text_short", truncColUdf(col("text_long")))
  .withColumn("text_short_length", length(col("text_short")))

truncDataFrame.show(5, false)

val truncDynamicFrame = DynamicFrame(truncDataFrame, glueContext)

...

//write to sink

Output:

+-----------------------+--------------------+-----------------+
|text_long              |text_short          |text_short_length|
+-----------------------+--------------------+-----------------+
|I'd rather not answer  |I'd rather not answe|20               |
|Agree                  |Agree               |5                |
|Custom Answer Favorable|Custom Answer Favora|20               |
|Agree                  |Agree               |5                |
|Sometimes              |Sometimes           |9                |
+-----------------------+--------------------+-----------------+

Is there a way to simply truncate columns while inserting into Redshift via Glue?

2 Answers