How Do Merkle Trees Work?

Merkle Trees are a fundamental data structure in cryptography, especially when it comes to ensuring the integrity of data. They play a critical role in how we can verify large datasets efficiently, without needing to trust the source completely. This ability to verify the correctness of data in a decentralized or distributed environment makes Merkle Trees an essential tool in various applications, such as blockchain, peer-to-peer networks, and file-sharing systems.

In this blog, we’ll dive into what Merkle Trees are, how they work, and why they are so powerful in cryptographic systems. We’ll start by explaining the basic structure of a Merkle Tree and how it helps us check the integrity of data.

Whether you’re new to cryptography or already familiar with it, understanding Merkle Trees will give you a deeper appreciation of how we can ensure data integrity in today’s digital world.

To understand Merkle Trees, we first need to review the concept of a cryptographic hash function. A hash function takes an input (for example, a file or a block of data) and produces a fixed-size string of characters. A key property of these functions is that even the smallest change in the input results in a completely different output, and given some output. Additionally, it is computationally infeasible to deduce the original input.

At a high level, Merkle Trees are a cryptographic data structure that allows us to verify the integrity of large sets of data efficiently. Imagine a tree structure with nodes connected in a hierarchy: the leaves are at the bottom, and the root is at the top. In Merkle Trees, all data is stored in the leaves, and the non-leaf nodes hold cryptographic hashes of their child nodes. The root of the tree acts as a “fingerprint” for all the data stored in the leaves.

One key benefit of this structure is that if any part of the data changes, the hash of the affected leaf will change, which changes the hashes of its parent nodes, all the way up to the root hash. This cascading effect allows for quick and efficient detection of changes by comparing the root hashes.

Why a Tree Structure Instead of an Array?

We could organize data into blocks, hash each block to form an array of hashes, and then hash the entire array to produce a single digest. To validate a piece of data, we would need the hashes of all the blocks we don’t have, reconstruct the array of hashes, and hash it to compare the result with the trusted digest. This is how it would look:

However, Merkle Trees have a significant advantage: they require far fewer hashes to compute the digest and compare it with the trusted one. Only the sibling hashes leading up to the root are needed. As the number of nodes increases, the proving time for an array-based approach grows linearly, whereas the Merkle Tree approach grows logarithmically, making it much more scalable.

For example, in the next diagram, if we want to validate the data in block 4, we only need three other hashes (the green nodes) to reconstruct the root hash:

Key Advantages of Merkle Trees:

Locating Corrupted Data: If we are comparing one tree to a trusted one, and the root hash doesn’t match, we can trace the problem through the tree. By comparing the hashes of child nodes, we can find exactly which part of the data is corrupted. This reduces the amount of data we need to check. We start from the root leading up to the node.

Efficient Verification of Single Pieces: When verifying a single chunk of data, instead of rehashing the entire set, we only need the sibling hashes. This is because the tree structure allows us to verify the integrity of a piece by reconstructing the root using just a small portion of the data, saving both time and computational resources.

Example

Let’s work on an example to wrap up what we’ve learned in this entry:

Imagine you want to share a large file using a P2P (peer-to-peer) protocol. Typically, the file is split into smaller chunks and distributed across different peers. Once all the pieces are downloaded, they are reassembled to recreate the original file. However, this process comes with risks. A malicious peer could alter one of the file chunks and provide you with corrupted data, resulting in the wrong file once you combine all the parts. To prevent this, Merkle Trees offer an effective solution. Here’s how it works:

The original, trusted source of the file generates a Merkle Tree from the file’s chunks. Each chunk of data is hashed, and these hashes are combined into a tree, where the final root hash represents the entire file.
The trusted source shares only the root hash with the network. This root hash serves as a “fingerprint” for the file, ensuring that the file's integrity can be verified.
As you download chunks of the file from various peers, you can recompute the Merkle Tree based on the received chunks and compare the computed root hash with the one shared by the trusted source. If the computed root hash doesn’t match the trusted root hash, you know that one or more chunks have been altered.
Instead of rechecking all chunks, you can use the Merkle Tree to efficiently trace which chunk was corrupted. You do this by requesting the hashes of the two child nodes of the root hash, and comparing them with the hashes you calculated. This allows you to narrow down the corrupted chunk by splitting the tree in half at each step.
You continue this process, comparing hashes and descending the tree, until you locate the exact chunk that caused the issue. You can then discard the corrupted chunk and request a valid one from another peer.

This method of verifying data chunks is exactly how systems like BitTorrent ensure file integrity. Each piece of the file is independently verifiable, and the root hash ensures that the entire file can be trusted once all pieces are correctly assembled.

Try It Yourself

When studying this topic for the first time, I was looking for some way of visualizing Merkle Trees and couldn’t find any. That’s why I coded this little project so you can easily see how the hashes change when some chunk of data changes, and how the root hash is recomputed using only some nodes.

https://merkle-tree-visualizer.vercel.app/

If you need any instructions on how to use the app, hit the information icon in the upper right-hand side, or contact me on Twitter/X.