DynamoDB Fundamentals
Intro
In the past few years I’ve been heavily using DynamoDB for various projects and want to share my experience in the series of blog posts. DynamoDB is a great tool, but it shouldn’t be the default tool depending on your goal.
DynamoDB (and other NoSQL DBs) is very easy to get started - it allows non structured data, just stick it in the db and figure it out as it goes. And then you end up having extra technical debt and performance issues since you didn’t anticipate some of the access patterns.
For prototypes and still evolving applications - you will be better off with an SQL database in the beginning, as your access patterns might still be evolving as you build out your early version of the product. This is also a constant reminder to myself.
This is definitely not a full deep dive into DynamoDB fundamentals. There are way more features to go through, but this is a good part, and everything else will build on top of this post.
What is DynamoDB?
DynamoDB is a NoSQL database service provided by Amazon Web Services. It’s highly scalable and can provide single digit millisecond requests (p99 latency) at a large scale. Amazon Prime Day in 2021 achieved almost 90 Million requests per second using DynamoDB.
If you design your access patterns and partition keys correctly, you don’t need to do anything extra to make sure you get the benefits once your workload increases.
And you don’t need to expect a huge scale to use DynamoDB, you can use it for smaller applications too, but you need to know your access patterns before you design the table.
Core DynamoDB concepts
Table, Items, Attributes
You store data in DynamoDB. No surprises here.
Your data is stored as attributes, you can store various types of data - numbers, strings, binary, boolean, null values, lists and maps of data. You can split these into three categories - scalar (single value), documents (can contain multiple values), sets (contain unique values).
For example:
id = "ABCD123"
If you want to store the a few values together in DynamoDB you’d have following:
id = "ABCD123"
score = 5
createdAt = 20241020T11:00:00Z
extra = {"min": 3, "max": 7}
Multiple values are contained together as an item
. There is a limit of 400kb per item. It’s not adjustable, if you want to store more data you can store it elsewhere and keep a pointer in the database.
And your table
is a collection of various items
.
The main idea behind many NoSQL databases - you don’t need the schema to add items to the database, you can add anything and add extra attributes in the future. They are considered to be schema-less.
The only attribute required in a DynamoDB table is a primary key.
Q: Are the NoSQL databases really schema-less?
(click to flip)It’s true you can add data without any schemas, but they still have the schemas, it’s just these schemas are checked when you read the data from the database. In SQL databases, the schema is on write - you have to have a column with a specific type to add this record to this database.
WCUs and RCUs
Writing and reading each item in the database consumes Write and Read Capacity Units.
Writing 1kb of data to the DynamoDB table takes 1WCU.
For reads, you have strong and eventually consistent reads. 1
strong consistency read of 4kb consumes 1RCU. When you read data with eventually consistent reads - you might end up with stale data that hasn’t been updated yet. Since DynamoDB is a distributed system under the hood, and data is written to multiple separate nodes, some nodes might still be updating after your write, so the data could be old for a fraction of a time. Depending on your application this might be acceptable. If you don’t need strong consistency, you pay less - 1
eventually consistent read of 8kb would cost 1RCU.
Another quick question for you:
If you write item that’s 100kb of data, it will consume 100 WCUs, when you read this item back it’s 25 RCUs. How many WCUs would consume an update of 1kb of data in this 100kb item?
100WCUs, updates are atomic per item, not per attribute. We’ll discuss what to do about it in the next post. (it’s vertical partitioning of data)
Primary Key
The other core part of DynamoDB is the primary key. Primary key is the only required item
attribute. You can either use a Partition Key
a single value, or Composite Key
- a mix of a Partition Key
and a Sort Key
. You have to select if you want Partition Key
or the Composite Key
when you create a table, and you can’t change it later. We’ll dive deeper into Partition Key and Sort Key in the next section.
If you only have a Partition Key
- this value has to be unique, if you have Composite Key
- a combination of Partition Key
and the Sort Key
has to be unique.
Partition Key
In DynamoDB Partition Key is used to identify nodes where your data is stored, it’s done via an internal hash function that takes the Partition Key value to calculate the hash that corresponds to the physical nodes where this data is located.
We haven’t touched on limits yet, but partition has a limit, total amount of data a single Partition Key can have is 10GB, it can’t be larger than that, and ideally you’d want to avoid a single partition growing too large as that might affect the performance of your requests. So even if you use a combination of Partition and Sort keys, all data for that partition still has to be less than 10GB.
Sort Key
For sort keys, you can use strings, binary or numbers, but you want to use strings. You can’t change the SK type once the table has been created.
Sort keys aren’t only for sorting. In the next post, I’ll dive deeper into how to use a single dynamodb table for storing multiple entity types. This will also help you optimise your database’s reads to achieve consistent low latency reads. Just to give you an example, this is totally valid sort keys (just examples, as for larger datasets you’d use concise naming):
PK: ORDER#1
SK: ORDER#META
PK: ORDER#1
SK: ORDER#LOG#20241124T150327Z
Q: What benefit does this SK method provide us?
(click to flip)Indexes
In addition to the main table, DynamoDB supports indexes - eventually consistent representation of your data.
DynamoDB has two types of indexes - local and global secondary indexes. It’s highly unlikely you’ll use local secondary indexes, and I’ll focus only on Global Secondary Indexes. The main difference between them is local indexes use the same partition key but different sort key, and GSIs can have their own partition and sort keys.
Global Secondary Indexes allow you to specify data you want to query based on another primary key. For example, if you have an item in the db with PK ORDER#1 and SK#META
, you could have a secondary index PK and SK in the same item with PK: toprocess
or PK: USER#userid SK: timestamp
. This allows you to query data for a user or get sorted records for the user.
Secondary indexes primary keys don’t have to be unique, i.e. you can have duplicates for Primary key, but the constraint is still the same - 10GB limit per partition, and you still might run into hot partitions.
You also don’t have to propagate all data in the secondary index, so it can be more lightweight. And indexes are eventually consistent, meaning there could be a slight delay until the data is there.
Additionally, propagating data in secondary indexes consumes extra WCUs when you write items to the main table.
And you don’t have to add both GSI PK and SK in the item, this will act as a filter too, as you can remove ToProcess
PK attribute from the item and the item will be removed from the secondary index.
By the way, ToProcess
as PK attribute depending on the amount of data you have. You can’t have more than 10GB of items with PK ToProcess
.
Time To Live
Another valuable feature of DynamoDB is Time To Live. You can mark items in the database to be automatically deleted when they expire. It’s not guaranteed to be deleted at exactly the time it’s set to expire (can take days), as it might take some time for the table to remove it, so you would need to filter it out in your queries to the database. But this feature allows you to avoid removing these values manually.
How data is stored in DynamoDB
DynamoDB data is stored in nodes. Each node has SSD partitions (10GB max). Each partition has a limit of 3000 RCUs and 1000 WCUs. This data is also replicated across multiple availability zones in AWS.
When your data grows, it will be automatically split into two partitions (or more for provisioned throughput). This is a background process that your application doesn’t need to handle.
Each partition has a B-Tree index for its items for optimised requests - it’s stored in multiple levels. The first level is the partition key, and the second level (if present) is the sort key.
For optimal requests to your database, you want to have a nice distribution of your partition keys across multiple partitions. If you have one partition that has 10GB of data, but all other partitions have just 1-2GB, then your requests on 10GB partition will take longer to process.
What’s next
There are way more items to go through. In this post I’ve only focused on the fundamentals that I’ve noticed were the most helpful to others to understand the DynamoDB. In the next post I’ll go through a single table design - using a single table to store multiple entities. If you only used SQL databases, this sounds wild, but it does work.