Requirment: I need a glue job to get the aws-dynamodb(nested structure-combination of maps and list) data into s3.
My approach: First, i used glue-dynamic frame to get all the data from dynamodb into one dynamic frame.
datasource = glueContext.create_dynamic_frame.from_options(
"dynamodb",
connection_options={
"dynamodb.input.tableName": table_name,
"dynamodb.throughput.read.percent": read_percentage,
"dynamodb.splits": "100",
}
)
after using this, i got datasource
dynamic frame with all the data.
here i want to do some sort of transformation and want to perform some filters, so thats why i used pyspark dataframe concept.
df0 = datasource.toDF()
my input dataframe df0
contains json data collection
in the struct format, so i used to_json
to convert struct into json-string. here i need json string not the struct.
df1 = df0.select(to_json("collection"))
from df1
, i am accessing whatever i want.
Major Issue
some of the attributes present in the collection are appearing like this
collection : {
"name" : "aaa",
"id" : "111" ,
"address" : "some address",
"price" :
{"string" : 1212.0 },
"retailer" :
{"string" : "xxxx"},
"categories" : "array": [
"7216"
]
}
if you see above example price
,reatiler
,categories
, datatypes are appearing as a nested attribute.
i want output like this
collection : {
"name" : "aaa",
"id" : "111" ,
"address" : "some address",
"price" : "1212.0",
"retailer" :"xxxx",
"categories" : "[7216]"
}
How can i resolve this, please let me know