A Summary of How Fraud Proofs and Data Availability Sampling Work

1. Overview

Data availability (DA) is crucial to the functionality and security of blockchains. In traditional monolithic blockchain structures, full nodes retrieve blocks from peers to verify the integrity of the entire history of their chain based on predefined rules. This process relies on the assumption that data on the blockchain remains consistently "available" for access by all nodes in the network. This principle holds not just for layer-1 blockchains but also for layer-2 blockchains, which rely on data availability to inherit the security properties of their underlying layer-1 chains. For instance, optimistic rollups require DA for fraud-proof constructions, while ZK-rollups depend on DA to allow users to keep track of their balances and ensure that the rollup full-state is recoverable so that users can withdraw their funds to L1 without the rollup operator.

In the blockchain space, there is an approach called modular architecture, which breaks a monolithic blockchain into multiple function-specific chains. A data availability (DA) layer is one of them. Projects like Celestia, EigenDA, and Espresso's Tiramisu DA are at the forefront of creating these DA layers, setting a foundation for rollups and future modular blockchains. Similarly, Ethereum aims to enhance its rollups' data availability by introducing data blobs (through danksharding and proto-danksharding).

This article delves into the key concepts presented in the paper: https://arxiv.org/abs/1809.09044. This paper introduced the data availability problem, the key idea used in Celestia, other DA layers, and data availability sampling in Danksharding—a part of Ethereum's rollup-centric roadmap. Interestingly, the approach from the paper shows that nodes can probabilistically verify data availability without downloading an entire block. Such an approach minimizes hardware requirements, allowing even light nodes to enhance the network's security. Paired with fraud proofs, light clients can be highly confident in the integrity of the blocks they receive.

2. Light Clients and Full nodes

Light clients allow users to interact with and access blockchain data (e.g., account balances and transaction history) securely without fully trusting the data provider. Instead of syncing and validating the entire blockchain like full nodes do, light clients operate on block headers. They rely on full nodes or centralized RPC services to provide the necessary data and associated proofs, allowing them to verify the information without downloading the full block content.

However, efficiency comes with trade-offs. Light clients cannot independently verify entire blocks. They need a way to ensure that the block they receive has been approved by a majority of validators, which is typically achieved by having validators sign the block header. Unfortunately, a light client cannot practically verify the signatures of all validators, especially in blockchains with thousands of them. Ethereum, for example, addresses this challenge by using a rotating set of 512 validators, known as the sync committee. This committee changes every 1.1 days (or 27 hours) and is responsible for attesting to block headers.

While this approach allows light clients to remain efficient, it does not mean they are performing validation themselves. Instead, they rely on the assumption that the majority of validators—or more specifically, the majority of the selected committee—are honest. This creates a potential vulnerability: if the majority is dishonest, light clients may follow a corrupted consensus without the evidence needed to reject it.

In contrast, full nodes download and validate the entire blockchain, which gives them the ability to reject any invalid block, even if it is supported by a dishonest validator majority. This leads to an important question: can honest full nodes, even if they are in the minority, help light clients reject invalid blocks and avoid relying on the assumption of an honest majority?

The answer is yes. That is where fraud proofs come in.

3.Fraud Proofs

The idea of fraud proofs is to allow a full node to generate a proof showing the invalidity of a block. This could be due to an invalid state transition, an incorrect state root, or the inclusion of invalid transactions. The proof can then be relayed to light clients, convincing them that the block is invalid. For efficiency, a fraud proof should include only the data necessary to validate the disputed part of the block, allowing light clients to process just those fragments and determine whether the block is valid or not, without needing to verify the entire block.

Next, let's explore an overview of how fraud proofs can be effectively integrated into a block, enabling a blockchain to natively support fraud proofs. It's important to note that ensuring data availability emerges as a natural prerequisite for generating fraud proofs efficiently.

Rather than storing just a list of transactions, the paper introduces snapshots of state transitions called traces. Traces function like checkpoints within a block. They represent intermediate states of the blockchain at specific points and are interleaved with transactions. As illustrated in Figure 1:

$trace1$ is the intermediate state resulting from applying transactions $txn1$ to $txn3$ to the final state of the previous block (i−1).
$trace2$ is produced by applying $txn4$ to $txn6$ to $trace1$ , and so on.

Figure 1 shows an example of how transactions and traces are arranged within a block.

This structure allows full nodes to efficiently generate fraud proofs using only a small subset of the block, that is, the relevant invalid intermediate state(s) and associated transactions. Light clients can then verify this subset without having to execute the entire block.

When a light client initiates a data request, it can ask the network for a fraud proof corresponding to the block in question. If no fraud proof is received within a predetermined timeout, the light client can be highly confident that the block is valid. However, this approach depends on the light client being connected to at least one honest full node that is capable of generating and serving the required fraud proof.

4. Data Availability Proofs

Although fraud proofs offer an enhanced layer of security and bring light clients closer to the security level of full nodes, a light client must remain connected to at least one honest full node to receive fraud proofs when needed. This setup depends on one critical requirement: data availability. Full nodes need access to the block data in order to construct fraud proofs.

Data availability remains essential even when moving from fraud-proof schemes to validity-proof schemes, such as those based on Zero-Knowledge Proofs (ZKPs). Without access to data, nodes may be unable to reconstruct the blockchain state. As a result, users and nodes could find themselves unable to interact with the chain, even if they are confident that the state transitions are valid. This is the essence of the data availability problem: ensuring that data is not only published but also accessible to the network.

But how can light clients verify that data is actually available without downloading the entire block?

A simple approach is to have each light client independently and randomly sample small portions of the block. With repeated sampling, a light client can be confident with high probability that the block data is available. However, this method is not fully reliable. There is always a chance that some missing parts of the block go unsampled, causing light clients to mistakenly believe the block is fully available when it is not.

To solve this issue, the research paper introduces the use of erasure codes to extend the data. Erasure coding adds redundancy, allowing the original data to be reconstructed even if parts are missing. When combined with data sampling, this approach results in a highly efficient scheme that ensures, with very high probability, that if any part of the data is missing, at least one light client will detect it during sampling.

4.1 1D Reed-Solomon Coding

An erasure code is a method that extends a message of length $k$ into a longer message of length $n$ , allowing the original message to be reconstructed from any subset of $k$ elements of the extended message. For instance, consider a message $m$ of length $k$ , denoted as $(m_0, m_1, ..., m_{k-1})$ . Using erasure codes, this message can be extended to length

Reed-Solomon coding is a specific method of encoding a message using polynomials. By interpreting the message $m = (m_0, m_1, ..., m_{k-1})$ as a polynomial $P(x)$ , we can construct an extended message as follows:

Define a polynomial $P(x)$ such that $P(i) = m_i$ for all $0 ≤ i < k$ .

The resulting message $(m_0, m_1, ..., m_{k-1}, ..., m_{n-1})$ satisfies $P(i) = m_i$ for all $0 ≤ i < n$ .

To make this more intuitive, you may recall from high school math class that, given two points on an $xy$ coordinate plane, one can create a line passing through those points. Think of erasure codes similarly. Given the initial message, represented as two points $(0, P(0) = m_0)$ and $(1, P(1) = m_1)$ , you can draw a line passing through these points. This line represents the Reed-Solomon code of the initial message. Specifically, all the points on that line are the extended message under Reed-Solomon encoding.

It is important to note that in the presence of errors, Reed-Solomon codes can correct up to (n - k)/2 erroneous elements, not (n - k). A full explanation of error correction is beyond this article's scope, but those interested can refer to source [8] for a more detailed explanation.

Figure 2 illustrates how block i can be divided into chunks.

Returning to the paper, the data we are encoding consists of the sequence of transactions and intermediate states, as introduced in Section 3 (see Figure 1). We interpret this data as a message and divide it into small, equal-sized chunks, with each chunk treated as an element of the message. As shown in Figure 2, block $i$ is partitioned into four chunks:

Chunk 1: $(txn1, txn2, txn3, Tr1)$
Chunk 2: $(txn4, txn5, txn6, Tr2)$

We then apply Reed-Solomon coding to encode this message.

However, one-dimensional encoding is not optimal. As the message length increases, the degree of the polynomial also increases, leading to higher computational complexity. In addition, if a one-dimensional encoding is generated incorrectly, it can corrupt the entire dataset, resulting in an unrecoverable block.

A solution to this problem is to encode the data in two or more dimensions. With this multi-dimensional encoding approach, an incorrectly generated code may only affect a portion of the data along specific dimensions, rather than corrupting the entire dataset as in the one-dimensional case.

In the one-dimensional setting, the size of a proof for detecting an incorrect encoding is $O(n)$ . In contrast, when using d dimensions, the proof size decreases to approximately $O(n^{1/d})$ . Conceptually, this involves reshaping a one-dimensional message of length $n$ into a $d$ -dimensional hypercube, where the length of each side is $n^{1/d}$ . This structure enables the construction of proofs along specific axes, which significantly reduces the proof size to $O(n^{1/d})$ .

4.2 2D Reed-Solomon Coding

To apply two-dimensional Reed-Solomon encoding, we first arrange the message into a square matrix instead of a one-dimensional vector. The process for encoding a message of length $n$ using 2D Reed-Solomon coding is as follows:

Reshape the matrix: Convert the message of length $n$ into a square matrix of dimensions $k \times k$ , as shown in Figure 3. If $n$ does not equal $k^2$ , pad the message as needed so that it fits into the matrix properly.
Row and column encoding: Apply one-dimensional Reed-Solomon encoding to each row and each column of the square matrix. This produces the encoded rows and columns, as represented in the upper right and lower left matrices of Figure 4.
Vertical extension encoding

Figure 3: This illustration shows the message from Figure 1, rearranged into a two-dimensional square matrix. The top portion of the figure provides a detailed breakdown of the message contents, while the bottom part presents a simplified matrix representation of the same data.

Figure 4: This figure visually depicts the 2D Reed-Solomon encoding process as it is applied to the square matrix.

Figure 5: This figure shows the final matrix resulting from the 2D Reed-Solomon encoding process.

4.3 Unrecoverable block

Understanding the minimum amount of data a malicious block producer must withhold to make a block unrecoverable is crucial. As an example, consider the matrix of original data structured as a $k \times k$ matrix, which is then extended to a $2k \times 2k$ matrix, as shown in Figure 6. Because the original data accounts for 25 percent of the total data—since the matrix expands from $k^2$ elements to $4k^2$ —one might expect that the encoding could even tolerate up to 75 percent data loss. As an example, consider the case of 50 percent data loss, where both the upper left and upper right quarters are missing. It might seem possible to recover them using the lower left quarter.

Figure 6: The image shows a kk matrix that gets extended to a 2k2k matrix, which we will use throughout this section.

Unfortunately, a malicious block producer needs to withhold only $(k + 1)^2$ out of $4k^2$ elements, which is just a bit more than 25 percent of the encoded data, to prevent an honest full node from reconstructing the entire block. This is much lower than the 75 percent data loss we previously expected. Let’s explore this more systematically:

Case (a): Suppose a malicious actor withholds the upper left quarter of the extended matrix, marked red in Figure 7(a). Although this removes a significant portion of the data, the block can still be recovered using either the upper right or lower left quarters.
Case (b): The actor tries harder by also withholding the first column of the upper right quarter, as shown in Figure 7(b). Yet even with this additional missing data, the block remains recoverable using the lower half of the matrix.
Case (c): Finally, the malicious actor withholds both the first row of the lower left quarter and the first element of the lower right quarter, as illustrated in Figure 7(c). At this point, the block becomes unrecoverable. From case (b), we know the upper right quarter cannot recover the upper left. Due to symmetry, the same limitation applies to using the lower left to recover the upper left. The remaining hope would be to use the lower right quarter to recover the lower left, then use the lower left to reconstruct the upper left, and finally restore the upper right. However, once the first element of the lower right quarter is missing, the first row of the lower left becomes unrecoverable, and the entire reconstruction process fails.

It is important to note that case (c) represents a worst-case scenario. In this case, the block producer must withhold only $(k + 1)^2$ out of $4k^2$ elements to render the block unrecoverable. However, creating such a scenario requires deliberately withholding specific parts of the block data.

Figure 7: The three diagrams illustrate the extended matrix. Red slots represent unavailable chunks, while green slots indicate available ones. In cases (a) and (b), despite having k² and k(k + 1) missing chunks respectively, the block remains recoverable. However, in case (c), with (k + 1)² missing chunks, the block becomes unrecoverable.

4.4 Security Analysis

When a light client receives a block header, the paper suggests that it should randomly sample the contents of the block. This step mitigates the risk of a malicious block producer selectively revealing only portions of the block data. The client continues sampling until it becomes confident that the data is sufficiently available for an honest full node. This assurance is essential, as it ensures that a full node can reconstruct the block and produce a fraud proof if necessary. In this way, the security of the light client relies on the process of data availability sampling, which enables an honest full node to support the light client with a fraud proof if the block is invalid.

However, one might ask: How many samples are required for a light client to be reasonably confident—say, with high probability—that the block data is available?

Building on the idea from the previous section, we know that a malicious block producer needs to withhold only $(k + 1)^2$ out of $4k^2$ elements from the extended matrix to make the block unrecoverable. The goal is to determine the probability that a light client, while randomly sampling from the matrix, encounters at least one of these unavailable elements. Assume the encoded matrix is of size $2k \times 2k$ , and the light client randomly picks $s$ independent samples. The matrix contains $4k^2 − (k + 1)^2$ available elements, meaning $(k + 1)^2$ elements are missing.

Recall from high school math that when selecting r elements from n elements without replacement, the number of possible combinations is represented as "n choose r", or

$\binom{n}{r} = \frac{n!}{(n - r)! \cdot r!}.$

Consequently, the probability that a light client samples only available elements—meaning it never encounters any unavailable data—is given by:

$p(x = 0) = \frac{\binom{4k^2 - (k + 1)^2}{s}}{\binom{4k^2}{s}}.$

The numerator, $\binom{4k^2 - (k + 1)^2}{s}$ , represents the number of ways the light client can sample s elements such that all of them fall within the available portion of the matrix. The denominator, $\binom{4k^2}{s}$ , represents the total number of possible ways to choose s elements from the entire matrix, regardless of whether they are available or not. As a result, $p(x = 0)$ gives the probability that all s samples land on available data, which is simply:

$p(x = 0) = \frac{\text{number of favorable outcomes}}{\text{total number of possible outcomes}}.$

From this, we can derive the probability that the light client encounters at least one unavailable element:

$p(x \geq 1) = 1 - p(x = 0)$

or equivalently,

$p(x \geq 1) = 1 - \frac{\binom{4k^2 - (k + 1)^2}{s}}{\binom{4k^2}{s}}.$

To obtain a cleaner form of the above equation, one could rearrange it with a few lines of math, but instead, I’ll just copy the simplified expression from the paper, which is:

$p(x \geq 1) = 1 - \prod_{i = 0}^{s - 1} \left( 1 - \frac{(k + 1)^2}{4k^2 - i} \right).$

Let’s now plot a graph of this equation to analyze the probability, which will help us understand the security of a light client more easily.

Figure 8 illustrates the probabilities p(x≥1) for various values of k.

In Figure 8, it is evident that for smaller values of $k$ , such as $k = 8$ , the probability approaches 1 much faster than for higher values of $k$ . Interestingly, once k becomes large enough—as illustrated by $k = 256$ and $k = 512$ —the probability becomes almost independent of $k$ . This raises an important question: why choose a larger value of $k$ if it requires the light client to sample more elements from the matrix to achieve the same level of confidence?

The answer lies in the definition of $k$ . As explained in Sections 4.1 and 4.2, a message is divided into small, equal-sized chunks, with the number of chunks denoted by $k$ . As $k$ increases, each chunk becomes smaller, which reduces the total size of data that a light client needs to download.

At first glance, it might seem counterintuitive that as the chunk size decreases (with increasing $k$ ), the light client needs to sample more times (s) to maintain the same level of confidence. Since the total amount of data downloaded by the light client is approximately $k \times s$ , one might expect that these two effects would cancel each other out. However, the situation is not so simple.

Because the sampling probability becomes nearly independent of $k$ once $k$ is sufficiently large, we can hold s constant while increasing k, thereby reducing the chunk size. By optimizing in this way, the total bandwidth required by the light client can be reduced without increasing the number of samples. This strategy allows the client to operate efficiently, requiring only a small amount of data beyond the block header.

As a result, a light client can achieve security close to that of a full node, while requiring only slightly more bandwidth than a traditional light client that downloads only block headers. As shown in Figure 9, the number of samples required to achieve a 99 percent probability of detecting at least one unavailable chunk stabilizes after $k = 50$ . Therefore, even if $k$ increases further (reducing the chunk size), the total amount of data the light client needs to download continues to decrease.

Figure 9: This plot shows the number of samples a light client must take to detect at least one unavailable chunk with a 99 percent probability (that is, when p(x ≥1)≥0.99.)

1. Overview

2. Light Clients and Full nodes

The answer is yes. That is where fraud proofs come in.

3.Fraud Proofs

$trace1$ is the intermediate state resulting from applying transactions $txn1$ to $txn3$ to the final state of the previous block (i−1).
$trace2$ is produced by applying $txn4$ to $txn6$ to $trace1$ , and so on.