# on my rusty sparse merkle tree experiment

By [bt3gl's symposium](https://paragraph.com/@go-outside) · 2023-09-19

---

tl; dr
------

today i go over my implementation of a simple library for authenticated data structures and sparse merkle trees.

the source code, in rust, can be found [here](https://github.com/go-outside-labs/blockchain-science-rs).

* * *

🎵 today’s mood
---------------

[https://open.spotify.com/track/6AtBumPb5RsfgG3Xo6UsTW?si=2368f6231a324825](https://open.spotify.com/track/6AtBumPb5RsfgG3Xo6UsTW?si=2368f6231a324825)

**_“the first half of life is devoted to forming a healthy ego, the second half is going inward and letting go of it.” - carl jung_**

**_“the ego refuses to be distressed by the provocations of reality, to let itself be compelled to suffer. it insists that it cannot be affected by the traumas of the external world; it shows, in fact, that such traumas are no more than occasions for it to gain pleasure.” - sigmund freud_**

* * *

001\. authenticated data structures
-----------------------------------

an **authenticated data structure** (ADS) is an advanced data structure on which an **_untrusted prover_** can query for an entry, receiving a **_result and a proof_** so that the **_response_** can be **_efficiently checked for authenticity_**.

![](https://storage.googleapis.com/papyrus_images/06014b4aec8f8ef50d057f27b6750b7c7a00a9a4b4b000aac467233b25895b19.png)

you can think of ADSs as cryptographic upgrades of the [classic algorithms we are used to](https://github.com/go-outside-labs/master-algorithms-py) (such as hash maps, binary trees, or tries), with an extra operation added to the good an old `insert()`, `lookup()`, `delete()`, `update()`: the `commit()`.

in other words, we can utilize **_well-known cryptographic hash functions_** (such as SHA-256 or SHA-3) to calculate a **_collision-free_** data structure representation.

we call this hash representation string **_the_** **_commitment_** (e.g., a small and unique cryptographic representation of the data structure). commitments must uniquely determine the data structure's value (_i.e._, if `a.commit() == b.commit()`, then `a == b`).

because we trust our hash function, **_every query should have a valid collision-free proof_** + **_a valid proof should imply that the result is correct_**. this means that proofs for queries must be **complete** and **sound**, where completeness means that every valid query result has a valid proof and soundness means that valid proofs imply that the query results are correct.

> 💡 _according to seminal_ [_Miller et al_](https://www.cs.umd.edu/~mwh/papers/gpads.pdf)_, "an authenticated data structure (ADS) is a data structure whose operations can be carried out by an untrusted prover, the results of which a verifier can efficiently check as authentic. this is done by having the prover produce a compact proof that the verifier can check along with each operation’s result. ADSs thus support outsourcing data maintenance and processing tasks to untrusted servers without loss of integrity."_

### cryptographic hash functions

a _cryptographic hash function_ `H` is a special function that takes an arbitrarily long sequence of bytes and returns some fixed-size "digest" of that sequence. cryptographic hash functions have two special properties for our purposes:

*   preimage resistance: given a digest `d`, it's infeasible to calculate a string `s` such that `H(s) = d`.
    
*   collision resistance: it's infeasible to find two strings `s1` and `s2` such that `H(s1) == H(s2)`.
    

for this library, we will be using the `SHA-256` hash function, which has a 256-bit digest.

* * *

010\. authenticated (sorted) key-value stores
---------------------------------------------

an authenticated key-value store is an ADS of an "associative array" or "map".

the methods of the data structure are described in `src/kv_trait.rs` of my code:

    fn new() -> Self;
    fn commit(&self) -> Self::Commitment;
    fn check_proof(key: Self::K, res: Option<Self::V>, pf: &Self::LookupProof,
            comm: &Self::Commitment) -> Option<()>;
    fn insert(self, key: Self::K, value: Self::V) -> Self;
    fn get(&self, key: Self::K) -> (Option<Self::V>,Self::LookupProof);
    fn remove(self, key: Self::K) -> Self;
    

note that `insert()`, `get()`, and `remove()` behave like the same methods in `std::collections::HashMap`, except that `insert()` and `remove()` return copies of the ADS rather than taking `&mut self`.

you can check the full implementation of a sorted key-value store class, represented by a vectorial abstraction of a (unbalanced) binary tree (_i.e._, the underlying data structure is an array), at `src/sorted_kv.rs`.

![](https://storage.googleapis.com/papyrus_images/e6cbd92b23cc1911e7ee620eaebb6eed72f5378976c5b11d20bac3525cd3c7f6.png)

although the key sorts the structure, this is not a binary search tree, as it allows repeated entries. rather, the structure's commitment is an overall hash calculated recursively as a binary tree (this digest is obtained by mapping the merkle hash of each pair - named _"merkle mountain range"_).

* * *

011\. merkle trees
------------------

**merkle trees** are the canonical and original example of **authenticated data structures**, designed for **easy inclusion proofs** but not **easy exclusive proofs** (_e.g._, a prover can efficiently convince the verifier that a particular entry is present but not absent - _you will understand this better in the next session_).

for example, we could use a vectorial abstraction of a binary tree to represent a collection of keys and values at nodes given by an index `i`.

although there are multiple representations of an associative array, a general rule for this representation is that:

*   a right sibling could be found through `2 * ix + 2`, so `(i % 2 == 0) == true`
    
*   a left sibling could be found through `2 * ix + 1`, so `(i % 2 == 1) == true`
    

the path from a `(k, v)` node to the root would be calculated as the overall digest hash of the tree from the `(k, v)` node pointer, with all the sibling hashes.

![](https://storage.googleapis.com/papyrus_images/dcff9c6dea5258746dddb9482cd6173a1e50a991af4d60d07ac7d4cde3a524c5.png)

in other words, the calculation consists of iterating each sibling in the array of siblings' digests, attributing the hashes for left and right sub-trees, and then iteratively hashing them. in this logic, the root hash is the entire commitment of the tree.

> 💡 _since_ **_inserting at a specific position cannot be done efficiently_** _(as the whole tree would need to be recomputed), these trees are unsuitable for authenticated key-value maps._

* * *

100\. sparse merkle trees
-------------------------

now, let’s talk about something even more awesome: **sparse merkle trees**, which provide **efficient exclusion proofs** by design (try to figure out why 🕵🏻‍♀️!).

in a sparse merkle tree, a particular `(k, v)` is **represented by a leaf node** such that its path from the root is encoded by one of all possible hash digests of the chosen hash function.

so, if the chosen hash function is represented by `N` bits, the tree's height is `N`, and the paths down to the `2^N` leaf nodes are represented by (`N`\-nodes-deep) bit-strings.

![in the case of SHA-256, the tree has a height of 256 and 2^{256} leaf nodes represented by 256-bit paths.](https://storage.googleapis.com/papyrus_images/7175cb5c7ce719986b92d2b3997772f4fc6303d20da89c4c995960d83d5ca743.png)

in the case of SHA-256, the tree has a height of 256 and 2^{256} leaf nodes represented by 256-bit paths.

🕵🏻‍♀️ when the `(k, v)` entry is not present in the map, an empty leaf is assigned (for instance, with an all-`0` digest).

in other words, each possible `(k, v)` entry corresponds to a unique leaf node and is uniquely linked to its position in the tree. on the other hand, each leaf is a unique representation of the `2^N` possibilities presented by the digest size `N` of the cryptographic hash function.

### how about the branch nodes?

branches have left and right subtrees given by the digest of its child subtrees digest (something like `hash_branch(left_digest_until_here, right_digest_until_here)`, a concatenated hash of its child nodes).

so any parent branch node can be obtained by hashing together two child nodes recursively until the top to the merkle root of the tree.

> 💡 _note that different hash functions should be used for leaf and branch nodes to prevent_ [_preimage attacks_](https://en.wikipedia.org/wiki/Preimage_attack)_._

each child subtree position is found by looking at the bits of `hash_function(k)`. a left branch is denoted as `0` and a right branch as `1`, so the most left leaf's key is `0x000..00`, the next is `0x00..0`, the most right key is `0x11..11`, and so on.

in addition, we see that most leaf nodes are empty, and this is cool because the hashes of empty nodes are identical.

the same is true for interior nodes whose children are all empty: subtrees with no leaf nodes are represented as an empty subtree with a digest such as the darling all-`0` digest (and hashes of hashes of hashes of (…) of empty nodes are all predictable).

> 💡 _contrary to simple merkle trees, sparse merkle trees are a great choice for authenticated key-value maps due to the_ **_history independence_** _of the merkle root from element insertion order._

### “well, `2^N` with a loooot of zeroes sounds like waste”

yes, to naively create a sparse merkle tree, one would generate all possible `2^N` outputs of the hash function as a leaf in the tree and initialize them to the empty value. this generates a nearly unbounded number of data.

however, since required storage is only proportional to the number of items inserted into the tree, some optimizations are possible:

*   items on the leaf node are initially `None` and, therefore, do not need to be stored.
    
*   subtrees with no leaf nodes are represented as "empty subtree", with an all-`0` digest.
    
*   internal nodes all have fixed predictable values that can be re-computed as hashes of `None` values.
    

therefore, a sparse merkle tree can simply store a set of leaf nodes for the `(k, v)` pairs plus a set of empty hashes representing the sparse areas of the tree.

and because a sparse merkle is nearly balanced, if there are `N` entries in the tree, a particular entry could be _theoretically_ reached in `log2(N)` steps!

### rust is the best language, period

as we implement a sparse merkle tree in rust (using SHA-256), we can think of two different types of nodes: `SparseMerkleTreeNodeLeaf` and `SparseMerkleTreeNodeBranch` (the full source code for this class is available at `src/sparse_merkle_tree.rs`).

![](https://storage.googleapis.com/papyrus_images/1cbd27fb78b67bc58034c3be097763d1792bf5aa4bd91c8dec89ca54380fe758.png)

a very simple lookup function could look like this:

![](https://storage.googleapis.com/papyrus_images/fec3c3d9ccf65829953d241391dee1cdedf78f942a99af9ac6ad91f148f13766.png)

finally, a commit and checking proof could use the following path process:

![](https://storage.googleapis.com/papyrus_images/0242fdb0ed16b12b745abb476a6e782ad56ac0a976cf589b3974863e014ebb56.png)

* * *

101\. unit tests and fuzzing
----------------------------

to conclude this article, let’s go over some of the testing approaches in this library. first of all, i use [nix](https://\(https://nixos.org/manual/nix/stable/quick-start\)) to set up my virtual environments and you should too.

rust differentiates between "unit" tests and "integration" tests, as [described by its documentation](https://web.mit.edu/rust-lang_v1.25/arch/amd64_ubuntu1404/share/doc/rust/html/book/second-edition/ch11-03-test-organization.html). in my code, unit tests are annotated with `#[test] macro`, and integration tests are annotated with`#[cfg(test)] macro`.

alternatively, the test `hash_btree_insert_get()` is an example of a "simulation test", generating scenarios where two different data structures (`HashMap` and `BTreeMap`) are tested to behave "the same" for `insert()` and `get()` through this check:

    assert_eq!(hmap.get(&k), bmap.get(&k));
    

we extended this concept to compare the behavior of `SortedKV` and `SparseMerkleTree` against `HashMap` through, for example, `hash_sortedkv_insert_get()` (making sure we test for `SortedKV::check_proof()`):

![](https://storage.googleapis.com/papyrus_images/e1df3201f1ba43bc2a4075b8fd30b57a663675bf03d3a166116c7f884955cf3c.png)

we then use [quickcheck](https://docs.rs/quickcheck/latest/quickcheck/) for property testing, which creates several randomly generated (non-deterministic) inputs, such as in:

![](https://storage.googleapis.com/papyrus_images/2584a49281d159aa60a0eb7a7401e55aee47a3a97a13d03359f61810296dead3.png)

fun.

* * *

**◻️ motherofbots.eth**
-----------------------

---

*Originally published on [bt3gl's symposium](https://paragraph.com/@go-outside/on-my-rusty-sparse-merkle-tree-experiment)*
