Deep dive into the internals of an immutable database, immudb
Table of Contents
- About immudb
- Basic Building Blocks
- Lifecycle of a transaction in immudb
I've been working on a fast immutable database, called immudb (developed by Codenotary), and was very curious about the immutable nature of storage. I have written this blog to explain how data is stored in an immutable manner in the database and to understand the data structures (and cryptographic) algorithms used internally.
Immutable databases are used for tracking changes in sensitive data, and for auditing purposes. Tamper-evident data means you can cryptographically prove that the data hasn’t been unexpectedly changed. Knowing the provenance, integrity, and sequence in which your data was created means you’ll be able to monitor that the data has not been modified by an unknown user in an unauthorized way, which simplifies regulatory compliance and audit. This is an essential feature within the scope of many activities that collect critical user information and solve issues like insider attack threats, abuses, and patterns on the system through an audit.
A few decades back, storage was expensive, and so was CPU. This is why research in the past on database storage strongly considered storing information efficiently whilst utilising the least amount of space on disk. Few of the earliest users of databases were finance (and accounting) companies, as it made sense to store their ledger (or accounting) data in a database. Accounting ledgers are immutable. Entries are only appended to the ledger. Due to disk constraints, many databases added the support to UPDATE a record to utilise the disk efficiently. This is one reason why modern database storage systems are a bit complicated when compared to append-only storage. And this is why they are slow too, as efficient on disk data management (which requires UPDATES) requires complex data management techniques.
immudb is an open source Immutable Database that supports Cryptographical verification, tamper-resistance, and audit. It has support for both Key-Value and SQL and has high-performance and scalability solutions when compared to its competitors in the market.
Let's dive deep into how a record is stored on disk by immudb, and the data structures and algorithms used internally.
Basic Building Blocks
immudb consists of the following data structures:
- Append-only Logs
- Merkle Trees
- Timed B-Tree
Append-only log, also called write-ahead log, is a family of techniques for providing atomicity and durability (two of the ACID properties) in database systems. An append-only log is an auxiliary disk-resident structure used for crash and transaction recovery. Append-only log files keep a record of data changes that occur by writing each change to the end of the file. In doing this, anyone could recover the entire dataset by replaying the append-only log from the beginning to the end. It is also fast, as compared to other database storage systems, because writes are only being appended to a file.
It is a fundamental building block for our immutable database system. Since new updates are layered over the previous ones, developers can time-travel and look into the past versions of a record.
A Merkle tree is an authenticated data structure organized as a tree.
Authenticated means the integrity of the data structure can be efficiently verified using the (Merkle) root of the tree. The dataset cannot be altered without changing the Merkle root. The underlying data structure is organized as a tree, where each parent node is obtained by hashing the data from the child nodes in the layer below.
This is how a merkle tree is built:
- Individually hash each element of the original dataset into a leaf node.
- Hash together pairs of leaf nodes and store the resulting value in a parent branch node. The hashing function to obtain a leaf or branch node is different.
- Keep hashing pairs of branch nodes until you get to a single top branch node, the root of the tree.
Unlike an ordinary Merkle tree, immudb tree can produce inclusion and consistency proofs that allow verifying if a particular key is present in the transaction (inclusion proof) and if the later version includes everything in the earlier version, and all new entries come after the entries in the older version (consistency proof).
Inclusion proofs tell us to not simply trust the record, but to verify it by asking for proof.
Why create a tree when a hash chain could just work as such? Well, with a tree, it is possible to create proofs-of-inclusion for any data present in the original dataset. When a client gets a record from immudb, it does not simply trust the record. Instead, it requests an inclusion proof from the log and verifies that the proof is valid. This links the content of that record back to a tree head hash. The client then calculates the hash of the record and the chain of hashes up the tree to the tree head hash.
The root hash is the only part that needs to be stored on chain. To prove a certain value, you provide all the hashes that need to be combined with it to obtain the root. For example, to prove
C you provide
There are two entities involved in a proof-of-inclusion: the Verifier and the Prover. The Verifier’s responsibility is to check that certain data is part of the original dataset, while the role of the Prover is to prepare the proof and give it to the Verifier.
Inclusion Proof calculation is as follows:
- Verifier sends the Prover the transaction for which it wishes to obtain a proof-of-inclusion.
- The Prover finds this transaction in the Merkle tree and its index, indicating the transaction position in the tree.
- The Prover collects the intermediary hashes, and the transaction root and sends them to the Verifier.
- The Verifier uses the hashes and the transaction data it already has to recalculate the Merkle root and check that the root is equal to the one it originally had.
Consistency proofs tell us nothing was tampered with.
A consistency proof verifies the consistency between a log at two points in time. It verifies that the later version includes everything in the earlier version, in the same order, and all new entries come after the entries in the older version.
Merkle trees that accept input datasets with varying length are susceptible to second pre-image attacks. In a second pre-image attack, the goal of the attacker is to find a second input that hashes to the same value of another given input x. In Merkle trees, this corresponds to having two different input datasets giving the same Merkle root. To solve this problem, we prepend a different flag to the data before hashing it, depending on whether the resulting node is a leaf or branch node.
immudb Merkle Tree internals
immudb maintains three types of merkle trees:
Main Merkle Tree
Each database has one main Merkle Tree where its inputs are built from transaction hashes/ This tree is persisted in the storage layer. It is also appendable - each new transaction adds one extra leaf node with transaction hash (
Alh) and rebuilds the path to the root.
Internal Merkle Tree
Each transaction contains its internal Merkle Tree built from Key-Value-Metadata entries that are part of that single transaction. The root hash of that Merkle Tree (
EH) along with transaction metadata and additional linear proof information is used to build the transaction hash that is then input for the main Merkle Tree. This tree is not stored on disk and is recalculated on-demand from data that is building up the transaction.
The additional linear proof for the transaction is a mechanism implemented in immudb that is meant for handling peak transaction writes where the writes are much faster than the main Merkle Tree recalculation. It also forms a linear chain for all transactions since the beginning of the database.
You can read more about the proofs in detail in this excellent documentation from the team here.
Indexing is the first thing that comes to mind when we consider a database's performance. B-tree is an m-way tree that self-balances itself. Due to their balanced structure, such trees are frequently used to manage and organise enormous databases and facilitate searches, especially range queries. A B-Tree index speeds up data access because the storage engine doesn’t have to scan the whole table to find the desired data.
- keeps keys in sorted order for sequential traversing
- uses a hierarchical index to minimize the number of disk reads
- uses partially full blocks to speed up insertions and deletions
- keeps the index balanced with a recursive algorithm
immudb stores the index in a slightly modified version of a typical B-Tree, called a timed B-tree. A TBtree (a timed B-Tree) stores indexes for records in a database for lookups by the full key, a key range, or a key prefix. They are useful only if the lookup uses a leftmost prefix of the index. The index is useful for the following kinds of queries:
- Match the full key
- Match a leftmost prefix
- Match a range of values
- Match one part exactly and match a range on another part
As the tree’s nodes are sorted, they are helpful for both lookups and ORDER BY queries (finding keys in sorted order). In general, if a B-Tree can help find a row in a particular way, it can help you sort rows by the same criteria. So, the index will be helpful for ORDER BY clauses that match all the types of lookups we just listed. Why is this required you may ask? For SQL support.
Our SQL support layer is built on top of the TBtree too. Any SQL schema for a table contains information about the column and datatype.
CREATE TABLE table_name ( column1 datatype, column2 datatype, column3 datatype, .... );
immudb stores the schema and records for a table as any other key-value pair in it's append-only log storage. But to identify the SQL records, a prefix is appended on the key to help identify the schema (or catalog in immudb terminology).
With help of B-Tree for lookups by the full key, a key range, or a key prefix, immudb managed B-Tree support. Also revisions/history for keys are stored in the TBtree, this provides support for fast time-travel for any key by just querying the index.
Note that the values are not stored against the key in the Btree, rather we store the offset of the value in the value-log (explained below) against a key for faster lookups.
Each new transaction in immudb currently creates a snapshot to allow concurrent operations on the database. I'll write more on how MVCC support works in immudb in an upcoming blog.
Persistence (storage on disk)
The immudb storage layout consists of the following append-only logs per database:
- AHT: Append-only Hash Tree. Each database has one main Merkle Tree where its inputs are built from transaction hashes. This tree is persisted in the storage layer. It is also appendable - each new transaction adds one extra leaf node with transaction hash (
Alh) and rebuilds the path to the root. The AHT (Append-only Hash Tree) is a versioned Merkle tree where records are stored as digests left-to-right at the lowest level, following an append-only model. Like a hash chain, this AHT is persistent, with the difference that it supports efficient inclusion and consistency proofs. We can see that the AHT grows always from left to right. This growth property allows us to make efficient claims about the past because we can reconstruct old versions of the tree by pruning specific branches from right to left.
- Transaction Log: An append-only log for storing the transaction headers for a transaction. This helps in reading the header information of a transaction easily, as the header size is fixed.
- Commit Log: An append-only log for storing the information about the commits to a database. This log stores the transaction in an ordered way and is used for database recovery as transactions written in this log can be considered fully committed.
- Value Log: An append-only log for storing the actual values of key-value pairs within a transaction. The underlying data structure is again an append-only log. This log is kept separate for faster reads, and because many other data structures internally refer to the values, storing it in a separate log provides ease of access. The B-Tree index and transaction headers do not store the value itself but refer to the offset of the value in the value-log appendable.
- Index Log: An append-only log for storing the index for transactions in a database. Internally, immudb uses a B-Tree to index database records, and provide SQL support, and the index log is the B-Tree storage on disk.
Lifecycle of a transaction in immudb
When a transaction is sent to the immudb server, the following events take place:
- Key-Values are encoded in a transaction
- The values are recorded in the value-log appendable, and the offset of those values are stored in the transaction header (transaction log).
- The accumulated linear hash (ALH) is built by chaining the hash of the current transaction + previous transactions ALH + the inner hash of the transaction. This value is then appended to the AHT where the merkle tree is stored and proofs are requested from.
- There could be many concurrent transactions happening on the server, and the commit log stores information of transaction ordering. The Transaction header offset (from the transaction log) and size of the transaction are the only information stored in the commit log.
- Lastly the keys and value offset information is stored in the B-Tree index, and the indexing happens asynchronously and can be recreated on a restart of the database.
You can check out the codebase here, also feel free to raise any doubts you have by creating an issue or joining our Discord: