Transform json in AWS-GLUE and upload in Amazon Redshift

Question

i came forward reading this amazon article about flattening a json file and uploading in redshift.

https://aws.amazon.com/blogs/big-data/simplify-querying-nested-json-with-the-aws-glue-relationalize-transform/

My plan is to transform the json file and upload it in s3 then crawl the file again into the aws-glue to the data catalog and upload the data as tables in amazon redshift.

Now the problem with the code in 'Sample 3: Python code to transform the nested JSON and output it to ORC' shows some errors:

NameError: name 'spark' is not defined

Not I am at lost because I am new to aws-glue and I need to upload json (they are nested arrays)in redshift.

Here is my code:

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
#from awsglue.transforms import Relationalize



# Begin variables to customize with your information
glue_source_database = "DATABASE"
glue_source_table = "TABLE_NAME"
glue_temp_storage = "s3://XXXXX"
glue_relationalize_output_s3_path = "s3://XXXXX"
dfc_root_table_name = "root" #default value is "roottable"
# End variables to customize with your information

glueContext = GlueContext(spark.sparkContext)
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
dfc = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
blogdata = dfc.select(dfc_root_table_name)
blogdataoutput = glueContext.write_dynamic_frame.from_options(frame = blogdata, connection_type = "s3", connection_options = {"path": glue_relationalize_output_s3_path}, format = "orc", transformation_ctx = "blogdataoutput")

botchniaque botchniaque · Accepted Answer · 2018-05-24T07:07:52

You create wrongly the GlueContext. Your code should look like

from pyspark.context import SparkContext

glueContext = GlueContext(SparkContext.getOrCreate())

You can have a look at Glue code examples from AWS.

Transform json in AWS-GLUE and upload in Amazon Redshift

2 Answers