Data Modelling for Document Databases
1. Introduction
Designing the schema for a document database requires a very different approach from the typical relational database method. There are many applications for which both relational and document databases are a viable choice. However, designing a document database in the same way one would a relational database will result in a poorly performing and unscalable application.
We will go over how to make the right decisions when designing a document database schema.
2. Denormalization
Well-designed document databases are denormalized. This means that they duplicate the same data in multiple entries in the database.
By contrast, relational databases are normalized as they typically contain a single entry per unique entity. When using relational databases, if one entity must be retrieved together with a related entity, the entries must be joined together.
Document databases typically do not support joins. The ideal document database approach for this is to include related entity data in the main entity’s document. This allows us to efficiently retrieve the desired data by reading a single document.
Consider an application that contains a user view, which displays user data along with the name of the user’s role.
The application also contains a role view, which displays the role’s name and the role’s description. Given these requirements, we could expect the document for a role to look like this:
{ "id": "77f14fb8-5909-469c-b8fc-77a332aac81f", "name": "admin", "description": "this user can do anything" }
How can we retrieve the role’s name together with the other user data? We already know that we cannot join the user document with the role document.
The solution is to denormalize the role data by duplicating the role name in to the user document like this:
{ "id": "d93a74f9-beea-4e27-adbb-9e5397019cc9", "username": "user1", "role":{ "name":"admin" } }
The above document allows us to efficiently retrieve the user data and the role’s name by simply reading one document.
3. One-to-Few vs One-to-Many
When modelling relational databases, we often speak of one-to-one, one-to-many and many-to-many relationships. When modelling document databases, we must also consider one-to-few relationships.
A one-to-few relationship is conceptually very similar to a one-to-many relationship. However, while one-to-many relationships link the independent entity to a potentially unlimited number of dependent entities, a one-to-few relationship must only link a bounded number of dependent entities to the independent entity. The upper bound should be a reasonably low number. Let’s compare an example of a one-to-many relationship with a one-to-few relationship.
3.1 One-to-Many Example
Consider a country as the independent entity, and consider professional soccer players who are born in that country as the dependent entities.
A country can easily have over a thousand soccer players linked to it. Thus the number of links is not bounded by a reasonably small number. Furthermore, the number of players linked to the country can increase over time in an unpredictable way. We can thus conclude that this is a one-to-many relationship.
3.2 One-to-Few Example
Consider the starting lineup for a soccer team as the independent entity, and the players who are in that lineup as the dependent entities.
We know that a single soccer team’s starting lineup for a game cannot have more than eleven players. The number of links is thus bounded by a reasonably small number. Furthermore, we know that the number of links will not grow over time as it can never be greater than eleven. We can thus conclude that this is a one-to-few relationship.
4. Embedding vs Referencing
We already know that document databases require denormalization. However, not every entity needs to be denormalized. There are many cases where including a reference to the related document will provide better performance and scalability. The question of whether to embed related data or to reference the related document is thus crucial.
The user document from section 2 is an example of embedding related data on a document. If we choose instead to reference the role document, the user document will look like this:
{ "id": "d93a74f9-beea-4e27-adbb-9e5397019cc9", "username": "user1", "roleid": "77f14fb8-5909-469c-b8fc-77a332aac81f" }
Two key factors to consider when choosing to embed or reference, are the type of relationship between the entities and the common queries performed by the application.
4.1 Relationship Type
4.1.1 One-to-One
For one-to-one relationships whereby the application must retrieve the main entity and the linked entity together, the linked entity should be embedded in the main entity’s document. This will allow for a very efficient retrieval of all required data by reading one document.
Consider an application with a user view which displays both user and user-settings data. The user-settings entity represents the settings for that individual user and they are thus one-to-one. We will embed the user-settings data in the user document :
{ "id": "93ea242e-6105-4ca9-8afb-0150942e8981", "username": "user1", "usersettings" : { "language":"english" } }
4.1.2 One-to-Few
One-to-few relationships have the same considerations as one-to-one relationships. If the application must retrieve the dependent entities’ data together with the independent entity, the dependent entities must be embedded in the independent entity’s document.
Consider an application which must display names of the players in the starting lineup for a soccer team. Our document for the starting lineup would look like this:
{ "id": "e4dc7da1-7593-4360-bf27-6410495c019a", "name": "manchester united", "players" : [ {"name":"de gea"}, {"name":"wan-bissaka"}, {"name":"lindelof"}, {"name":"maguire"}, {"name":"shaw"}, {"name":"fred"}, {"name":"matic"}, {"name":"greenwood"}, {"name":"fernandes"}, {"name":"james"}, {"name":"martial"} ] }
4.1.3 One-to-Many
Referencing should be used for one-to-many relationships. The dependent entity’s document can contain a field which indicates the id of the independent entity.
Embedding can result in the independent entity’s document growing to an enormous size. Many document databases define a size limit for an individual document. For example, Cosmos Db currently allows a maximum size of 20 MB.
Furthermore, embedding the dependent entities can result in the independent entity’s document being updated very frequently. Every update to a relevant field on the dependent entity as well as any new links to dependent entities will require an update on the independent entity’s document.
Consider an application that tracks the countries which professional soccer players were born in. A country’s document could be defined like so:
{ "id":"d3decce1-d6d2-40fa-8979-fe64dc619f34", "name": "france" }
Each player would then have their own document which contains a reference to the country:
{ "id": "c538962f-671e-4749-a55d-f19d45165521", "countryid":"d3decce1-d6d2-40fa-8979-fe64dc619f34", "name": "kylian mbappe" }
4.2 Common Queries
Whether or not an application must retrieve related entities together is an important consideration when deciding whether to reference or embed.
Let’s revisit our one-to-one relationship example from section 4.1.1. We had decided that the user-settings data should be embedded in the user document. However, we can imagine a slightly different set of requirements which will make referencing a better option.
The application no longer needs to display the user and user-settings data together. Now the application must have a user view which displays only the user data. We can click a link on the user view to navigate to the user-settings view, which will only display the user-settings data.
With these new requirements, the ideal solution is to reference the user-settings id in the user document like this:
{ "id": "93ea242e-6105-4ca9-8afb-0150942e8981", "username": "user1", "usersettingsid" : "bcfd3188-6dc7-443a-bcdf-48c7f8eb89ff" }
When the application loads the user view, it will retrieve the above document which contains all the data it needs to display. When we click the link to navigate to the user-settings view, the application will use the value of the usersettingsid field to retrieve the user-settings document:
{ "id" : "bcfd3188-6dc7-443a-bcdf-48c7f8eb89ff", "language" : "english" }
4.3 Referential Integrity
Unlike relational databases, document databases will not maintain referential integrity themselves. It is the application’s responsibility to maintain referential integrity and to update all instances of denormalized data when needed.
5. Partitioning
One of the greatest advantages of document databases is that they are horizontally scalable. This means that they can be distributed across multiple servers. However, we must tell the document database how to split the data over the different servers. We do this by defining a partition key.
A partition key must be a field which exists on every document in the database. Documents which have the same value for their partition key field are considered to be in the same partition. All documents within a single partition must live on the same server.
The choice of partition key can have a huge impact on the application’s performance, scalability and infrastructure costs. Let’s review some important factors when selecting a partition key.
5.1 Cross-Partition Queries
Consider a document database which has five large partitions, each of which live on a different server. It’s easy to imagine that a query which must search all five partitions will be much slower than a query which only searches a single partition.
Queries over multiple partitions are called cross-partition queries and due to their low performance, we should seek to minimize them when selecting a partition key. This doesn’t mean that an application should make no cross-partition queries at all, but it does mean that cross-partition queries should be the exception rather than the application’s most frequently used queries.
Consider an application which allows users to view, create and modify to-do lists. A user can only view and modify their own to-do lists and cannot access the to-do lists of other users. A good choice for a partition key would be the userid. This would result in zero cross-partition queries, because a user would only ever be access data within their own partition.
One may worry that if the application has a thousand users, this would result in a thousand partitions as well. However, having a large number of partitions is not a concern. A thousand partitions does not mean a thousand servers are required. One server can contain many partitions. It is the individual partition which cannot be spread over multiple servers.
5.2 Number of Partitions
It is generally better to have many smaller partitions instead of a few big ones. A document database with a thousand partitions could technically scale to a thousand servers if necessary. A document database with three partitions would be restricted to a maximum of three servers. Thus, it is generally advisable to pick a partition key which results in a reasonably large number of partitions.
There are some cases where a small number of partitions or even a single partition can work well. A suitable case for a single partition would be for an application which always retrieves all documents in the database together. If there were multiple partitions, this would result in a slow cross-partition query.
A single partition would only be advisable for smaller databases. Several document databases define a maximum size for a single partition. For example, Cosmos Db currently has a maximum size limit of 20GB for a single partition.
5.3 Distributing Workload
Ideally, every physical partition (server) in a document database should receive a similar number of reads and writes. Partitions which receive a much higher workload than the others are known as hot partitions.
Hot partitions can result in the underutilization of the provisioned capacity of other partitions, which would lead to unnecessarily high infrastructure costs. This is why we must select a partition key which distributes workload somewhat uniformly between partitions.
Consider an application which allows users to create to-do lists for a particular date. Most users will create to-do lists for today. If we chose the date field as the partition key, it would result in a much higher write workload for today’s partition. A better choice of partition key would be userid, since the maximum difference in writes between two users is much lower than between two days.