Database schema design for product comparison

Question

I am looking to design a database schema to compare two products. Something like this https://www.capterra.com/agile-project-management-tools-software/compare/160498-147657/Clubhouse-vs-monday-com

Here is what I am thinking for the database schema design(only products of same category can be compared, please note that database is mongodb):

Categories table tagging the category of a product.
Store all the features corresponding to a category in the categories table.
In the product table store an array of per feature, where key is the feature name, value is the value of this feature in the product and category_feature_id is the feature_id in the categories table.

However, this makes the product table very tightly coupled with categories table. Has anyone worked on such a problem before ? Any pointers will be appreciated. Here is an overview of schema:

categories collection:

name: 'String'
features: [
{
name: 'string'
parent_id: 'ObjectID' // if this is a sub feature it will reference in this // embedded document itself
}
]

products:
name: 'String'
features: [ // Embedded document with feature values
{
name: 'String',
value: Boolean,
category_feature_id: 'ObjectID' // feature_id into the categories.features // table, majorly used to comparison only.
}
]

Hashcut Hashcut · Accepted Answer · 2019-04-30T01:42:56

I would consider making features a separate collection, and for each category or product, have a list of feature IDs. So for example:

Features collection:
{id: XXX, name: A}, {id: YYY, name: B}

Categories collection:
{ features: [featureId: XXX, value: C]}

Products collection:
{ features: [featureId: YYY, value: D]}

This has several advantages:

Conceptually, I would argue that features are independent of both categories and products. Unless you are sure that two categories will never share a feature, then you shouldn't have duplicate definitions of a single feature. Otherwise, if you ever want to update the feature later (e.g. its name, or other attributes), it will be a pain to do so.
This makes it easy to tie features to products and/or categories without coupling so tightly to the definitions within each category.
This allows you to essentially override category features in a product, if you want, by including the same feature in a category and a specific product. You can decide what this situation means to you. But one way to define this condition is that the product definition of the feature supersedes the category definition, making for a very flexible schema.
It allows users to search for single features across categories and products. For example, in the future, you may wish to allow users to search for a specific color across multiple categories and products. Treating features as 1st class objects would allow you to do that without needing to kludge around it by translating a user request into multiple category_feature_id's.
You don't need a category_feature_id field because each feature has the same id across products and categories, so it's easy to reference between a product and a category.

Anyway, this is my recommendation. And if you add an index to the features Array in both the categories and products collections, then doing db operations like lookups, joins, filters, etc. will be very fast.

EDIT (to respond to your comment):

The decision to denormalize the feature name is orthogonal to the decision of where to store the feature record. Let me translate that :-)

Normalized data means you keep only one copy of any data, and then reference that data whenever you need it. This way, there is only ever one definitive source for the data, and you don't run into problems where different copies of the data end up being changed and are no longer consistent.

Under relational theory, you want to normalize data as much as possible, because it's the easiest way to maintain consistency. If you only have one place to record a customer address, for example, you'll never end up in a situation where you have two addresses and you don't know which one is the right one. However, people frequently de-normalize data for performance reasons, namely, to avoid expensive and/or frequent queries. The decision to de-normalize data must weigh the performance benefits against the costs of manually maintaining data consistency (you must now write application code to ensure that the various copies of the data stay consistent when any one of them gets updated).

That's what I mean by de-normalization is orthogonal to the data structure: you choose the data structure that makes the most sense to accurately represent your data. Then you selectively de-normalize it for performance reasons. Of course, you don't choose a final data structure without considering performance impact, but conceptually, they are two different goals. Does that make sense?

So let's take a look at your example. Currently, you copy the feature name from the category feature list to the product feature list. This is a denormalization. One that allows you to avoid querying the category collection every time you need to list the product. You need to balance that performance advantage against the issues with data consistency. Because now, if someone changes the name in the either the product or category record, you need to have application code to manually update the corresponding record in the other collection. And if you change the name in the category side, that might entail changing hundreds of product records.

I'm assuming you thought through these trade-offs and believe the performance advantage of the de-normalization is worth it. If that's the case, then nothing prevents you from de-normalizing from a separate feature collection as well. Just copy the name from the feature collection into the category or product document. You still gain all the advantages I listed, and the performance will be no worse than your current system.

OTOH, if you haven't thought through the performance advantages, and are just following this paradigm because "noSQL doesn't do joins" then my recommendation is don't be so dogmatic! :-) You can do joins in MongoDB quite fast, just as you can denormalize data in SQL tables quite easily. These aren't hard and fast rules.

FWIW, IMHO, I think de-normalization to avoid a simple query is a case of premature optimization. Unless you have a website serving >10k product pages a second along with >1k inserts or updates / sec causing extensive locking delays, an additional read query to a features collection (especially if you're properly indexed) will add very minimal overhead. And even in those scenarios, you can optimize the queries a lot before you need to start denormalizing (e.g., in a category page showing multiple products, you can do one batch query to retrieve all the feature records in a single query).

Note: there's one way to avoid both, which is to make each feature name unique, and then use that as the key. That is, don't store the featureId, just store the feature name, and query based on that if you need additional data from the features collection. However, I strongly recommend against this. The one thing I personally am dogmatic about is that a primary key should never contain any useful information. You may think it's clever right now, but a year from now, you will be cursing your decision (e.g. what happens when you decide to internationalize the site, and each feature has multiple names? What if you want to have more extensive filters, where each feature has multiple synonyms, many of which overlap?). So I don't recommend this route. Personally, I'd rather take the minimal additional overhead of a query.

Database schema design for product comparison

1 Answers