How to query aggregate root by some other property apart from Id?

Question

For clarification: BuckupableThing is some hardware device with program written in it (which is backed-up).

Updated clarification: This question is more about CQRS/ES implementation than about DDD modelling.

Say I have 3 aggregate roots:

class BackupableThing
{
    Guid Id { get; }
}

class Project
{
    Guid Id { get; }

    string Description { get; }
    byte[] Data { get; }
}

class Backup
{
    Guid Id { get; }

    Guid ThingId { get; }
    Guid ProjectId { get; }
    DateTime PerformedAt { get; }
}

Whenever I need to backup BackupableThing, I need to create new Project first and then create new Backup with ProjectId set to this new Project's Id. Everything is working as long as for each new Backup there's new project.

But really I need to create project if only it doesn't already exist, where unique id of existing project should be it's Data property (some kind of Hash of byte[] array). So when any other BackupableThing gets backed-up and the system sees that another BackupableThing has already been backed-up with the same result (Data) - show already created and working project with all descriptions and everything set.

First I thought of approaching this problem by encoding hash in Guid somehow, but this seems hacky and not straightforward, also it increases chances of collision with randomly generated Guids.

Then I came up with the idea of separate table (with separate repository) which holds two columns: Hash of data (some int/long) and PlcProjectId (Guid). But this looks very much like projection, and it is in fact going to be kind of projection, so I could rebuild it in theory using my Domain Events from Event Store. I read that it's bad to query read-side from domain services / aggregates / repository (from the write side), and I couldn't come up with something else in some time.

Update

So basically I create read-side inside the domain to which only domain has access. And I query it before adding new Project so that if it already exists I just use already existing one? Yes, I thought of it already over night, and it seems that not only I have to make such domain storage and query it before creating new aggregate, I also have to introduce some compensating action. For example, if multiple requests sent to create the same Project simultaneously, two identical projects would be created. So I need my domain storage to be an event handler and if user created the same project - I need to fire compensating command to remove/move/recreate this project using existing one...

Update 2

I'm also thinking of creating another aggregate for this purpose - aggregate for the scope of uniqueness of my project (in this specific scenario - GlobalScopeAggregate or DomainAggregate) which will hold {name, Guid} key-value reference. Separate GlobalScopeHandler will be responsible for ProjectCreated, ProjectArchived, ProjectRenamed events and will ultimately fire compensating actions if ProjectCreated event occurs with the same name which already has been created. But I am confused about compensating actions. How should I react if user has already made backup and has in his interface related view to the project? He can change description, name and etc. of wrong project, which already has been removed by compensating action. Also, my compensating action will remove Project and Backup aggregates, and create new Backup aggregate with existing ProjectId, because my Backup aggregate doesn't have setter on ProjectId field (it is immutable record of backup performed action). Is this normal?

Update 3 - DOMAIN clarification

There's number of industrial devices (BackupableThing, programmable controllers) on the wide network which have some firmware programmed in it. Customers update the firmware and upload it into the controllers (backupable things). This very program is gets backuped. But there's a lot of controllers of the same type, and it's very likely that customers will upload the same program over and over again to multiple controllers as well as to the same controller (as a means to revers some changes). User needs to repeatedly backup all those controllers. Backup is some binary data (stored in the controller, the program) and date of the backup occurrence. Project is some entity to encapsulate binary data as well as all information related to the backup. Given I can't backup program in the state that it was previously uploaded (I can only get unreadable raw binary data which I can also upload back into controller again), I require separate aggregate Project which holds Data property as well as number of attached files (for example, firmware project files), description, name and other fields. Now, whenever some controller is backed-up, I don't want to show "just binary data without any description" and force user to fill in all the descriptionary fields again. I want to look up if there's have already been done backup with the same binary data, and then just link this project to this backup so that user who backed-up another controller would instantly see lots of information regarding what's in this controller lives right now :)

So, I guess this is the case of set-based validation which occurs very often (as opposed to regular unique constraints), and also I would have lots of backups, so that separate aggregate which holds it all in the memory would be unwise.

Also I just thought there's another problem raises. I can't compute hash of binary data and tolerate small risk of two different backups be considered as the same project. This is industry domain which needs precise and robust solution. At the same time, I can't force unique constraint at binary data column (varbinary in SQL), because my binary data could be relatively big. So I guess I need to create separate table for [int (hash of binary data), Guid (id of the project)] relations and if hash of binary data of new backup is found, I need to load related aggregate and make sure binary data is the same. And if it's not - I also need some kind of mechanism to store more than one relation with the same hash.

Current implementation

I ended up creating separate table with two columns: DataHash (int) and AggregateId (Guid). Then I created domain service which has factory method GetOrCreateProject(Guid id, byte[] data). This method gets aggregate id by calculated data hash (it gets multiple values if there's multiple rows with the same hash), loads this aggregate and compares data parameter and aggregate.Data property. If they are equal - existing and loaded aggregate returned. If they are not equal - new hash entity added to hash table and new aggregate created.

This hash table is part of the domain now and now part of the domain is not event sourced. All future need for uniqueness validation (name of the BackupableThing, for example) would imply creation of such tables which add state-based storage to the domain side. This increases overall complexity and binds domain tightly. This is the point where I'm starting to ponder over if event sourcing even applies here and if not, where does it apply at all? I tried to apply it to simple system as a means to increase my knowledge and fully understand CQRS/ES patterns, but now I'm fighting complexities of set-based validation and see that simple state-based relational tables with some kind of ORM would be much better case (since I don't even need event log).

So the only thing that can tell you if BackupableThing has already been backed-up with the same result... is to compute the result anyway? Since it does not improve performance, what do you gain from that uniqueness rule then? I mean, it's not like a backup was a business object needing unique identification... — guillaume31
"I can't compute hash of binary data and tolerate small risk of two different backups be considered as the same project" - why? What would be the consequences? — guillaume31
@guillaume31 if user backups some controller he expects to be able to restore the given backup in some time and controller should work as expected. Let's say for example after performing successful backup there was found already created project with the same data hash (but with another actual data). Not only user will see the wrong description, but if he chooses to upload "the same" program again to restore backup - this action will restore wrong program and controller begins to work wrong. These controllers control water pump stations, so this can lead to catastrofic results. — EwanCoder
How does that relate to DDD? If your concerns are correct, you have a hash function entropy problem, not a DDD problem, right? I mean, if you find a another project that has the same hash as the backup you're currently doing, how will you know if it's the same data or not? — guillaume31
I would load up related aggregate and check. It's better than have 2-MByte data field as unique key, that's an implementation detail. If I would go pure DDD and forget about performance problems, I would make my byte[] field into unique key and create new Project only if there's none with this byte[] field. — EwanCoder

guillaume31 guillaume31 · Accepted Answer · 2016-11-29T11:05:24

You are prematurely shoehorning your problem into DDD patterns when major aspects of the domain haven't been fully analyzed or expressed. This is a dangerous mix.

What is a Project, if you ask an expert of your domain? (hint: probably not "Project is some entity to encapsulate binary data")
What is a Backup, if you ask an expert of your domain?
What constraints about them should be satisfied in the real world?
What is a typical use case around Backupping?

We're progressively finding out more about some of these as you add updates and comments to your question, but it's the wrong way around.

Don't take Aggregates and Repositories and projections and unique keys as a starting point. Instead, first write clear definitions of your domain terms. What business processes are users carrying out? Since you say you want to use Event Sourcing, what events are happening? Figure out if your domain is rich enough for DDD to be a relevant modelling approach. When all of this is clearly stated, you will have the words to describe your backup uniqueness problem and approach it from a more relevant angle. I don't think you have them now.

How to query aggregate root by some other property apart from Id?

3 Answers

1. Enforce strong consistency

2. Eventual consistency