Retrieving arrays of nested information in AppSync schema

Question

I have worked out a fairly complex chain of DynamoDB resolvers on a GraphQL AppSync query. What I am curious to know is if I could have possibly designed this in a way to require fewer DynamoDB queries.

Here is my GraphQL Schema:

type Tag {
    PartitionKey: ID!
    SortKey: ID!
    TagName: String!
    TagType: String
}

type Model {
    PartitionKey: ID!
    Name: String
    Version: Int
    FBX: String
    # ms since epoch
    CreatedAt: AWSTimestamp
    Description: String
    Tags: [String]
}

type Query {
    GetAllModels(count: Int, nextToken: String): PaginatedModels!
}

This is the query that I am doing:

query GetAllModels{
  GetAllModels {
    Models {    
        PartitionKey        
        Name
        Version
        CreatedAt
        Description
        Tags {
          TagName
          TagType
        }
    }
  }
}

My DynamoDB table is set up as so:

PartionKey | SortKey       | TagName | TagType | ModelName | Description
Model-0    | Model-0       |                     ModelZero | Blah Blah   
Model-0    | Tag-Pine      |
Model-0    | Tag-Apple     |
Tag-Pine   | Tag-Pine      | Pine    | Tree
Tag-Apple  | Tag-Apple     | Apple   | Fruit

So in my resolvers I am going:

GetAllModels will scan with two filters. One filter for PartitionKey beginning with 'Model-' and another filter for SortKey begining with 'Model-'. This is to get all Models.
Next there is a resolver attached to 'Tags' in the Model object. This will query with two expressions. One for PartitionKey = source.Parition and a second for SortKey begin_with 'Tag-' this gets me all of the tags on a model.
Next there are two resolvers on the Tag object. One on TagName and another on TagType. These do a direct GetItem to get their appropriate value with PartitionKey = source.Sort and SortKey = source.SortKey set as the keys.

So each scanned Model ends up firing off 3 more queries to DynamoDB. This just seems a bit excessive to me. But I cannot see any other way to do this. Is there some way to be able to get both TagName and TagType in one query?

Is there a better way to approach this?

mparis mparis · Accepted Answer · 2018-08-29T02:39:39

I see a few things that I would personally change. The first is that I would avoid the nested DynamoDB scan operations. At least one of these can be replaced with a much faster query operation. The second is that I would consider rethinking how you are storing the data. Currently, there is no good way to list model objects.

Why is there no good way to list model objects?

Assuming each model object will have multiple tags then you are going to have a table that is sparsely populated by model objects. i.e. out of 100 rows you may have 20 - 50 models depending on how many tags the average model has. In DynamoDB, a table is split up based on the partition key causing rows that share the same partition key to be stored near each other to speed up query operations. With your setup where the Partition Key is essentially the unique id of a single model object this means that we can easily get a single model object. You can also quickly get the tags for a single object since those records are nearby as well.

The issue.

The DynamoDB scan operation looks at each partition one at a time, reads as many records as the requests limit allows or all of them if the limit is sufficiently large, and then, only after reading the records from the individual partitions, applies the filter expression before returning the final result. This means you may ask for the first 10 models but since the limit is applied before the scan filter, you may very well only get back 1 model (if that one model had 9 or more tags which would exhaust the limit while DynamoDB was reading the first partition). This may seem strange when coming from many different database systems and is an important consideration of its design.

Here are two solutions to address this concern:

1. Store Models in one table and Tags in another.

NoSQL databases like DynamoDB allow you to store many types of data in the same table but there is nothing wrong with splitting them out. Traditionally it can be a pain to work with multiple tables in a NoSQL database that lacks a join operation or something similar, but fortunately for us we can use GraphQL to "join" data for us. With the approach, the Model table has a single partition key named "id" and your GetAllModels resolver is still a scan but this time on the model table. This way the table is not sparse and you will get 10 models when you ask for 10 models. The Tag table should have a partition key of modelId and a sort key of tagId. You would then have a resolver on the Model.tags field that does a query against the Tag table and looks for rows with the modelId == $ctx.source.id.

This is essentially how @model and @connection work in the new graphql transform tooling launched as part of the amplify cli. You can see more here although the docs are as of writing still being improved. https://aws-amplify.github.io/amplify-js/media/api_guide

2. Store Models and Tags in the same table but change the key structure.

This approach works if you can reliably say that you will have less than 10GB of data per data type (e.g. Model & Tag). For this approach you have a single table with a PartitionKey of Type and Sort Key of id. When you create objects you create them with a Type e.g "Tag" or "Model" etc and a unique id (like a uuid). To list objects of the same type you do a DynamoDB query operation on the partition key of the type to list e.g. "Tag" or "Model". You can then use GSIs to efficiently look up related objects. In your case you would store a "modelId" is every Tag object. You would then make a GSI using the "modelId" as the Partition Key. To list all the tags for a given model you could then do a DynamoDB query operation against that GSI.

I'm sure there are many more ways to do this but hopefully this helps point in the right direction.

Retrieving arrays of nested information in AppSync schema

1 Answers