I am working on a project that manage production of large number of documents in batches. The workflow
- The user creates a new "Batch" using the application, based on a template that defines its requirements (requirements are usually files that user will have to upload and the system will process).
- Once all requirements are met, the system will process all inputs and generate a large number of documents (thousands)
- Those documents need to be post processed Just-In-Time
- Some batch operations have to be done, for example, publishing all documents, in which case, all those documents will need to be post-processed first.
- There are constraints on what operations can run simultaneously, each document can be post processed at most once, etc.
I have currently modeled the "Batch" itself as an aggregate root, but don't store the list of produced documents in the "Batch" object itself, but rather retrieve those documents from my data store using a collection id that was persisted in the "batch" object. The only reason I chose to do this was because I didn't want to make my aggregate root contain a large collection and become bloated, but this is getting in the way of developing the business logic, because now I have to deal with consistency issues across documents in the "batch".
My question is, in DDD/CQRS, when using a document database for persistence and/or when using event sourcing, how should one deal with aggregates that contain large collections?
I have seen this post and this post, neither addresses my concern, one uses nhibernate collection filters that is neither and option for me, nor do I think is the right way to deal with this issue since it leaks storage logic into the domain model, the other is more about accessing objects in nested aggregates and doesn't address storage/retrieval issues.
FYI, using .net, c#, service bus, using an oversimplified generic repository backed by SQL, planning to switch to MongoDb in very near future.