Paradigm：如何准确测量区块链网络延迟和吞吐量

如何正确测量区块链网络延迟和吞吐量是系统在设计和评估时最重要的步骤之一。由于许多共识协议和变体具有各种性能和可拓展性，到目前为止，仍然没有普遍认可的方法和值得参考的案例。本文我们尝试概述一种从数据中心系统测量获得启发的方法进行测量延迟和吞吐量，并讨论在评估区块链网络时要避免的常见错误。

网络延迟和吞吐量及其相互作用

开发区块链系统时应考虑两个重要指标：延迟和吞吐量。

交易延迟是从发起交易或付款到收到确认其有效之间的时间，交易延迟的长短会直接影响用户体验。在 BFT 系统（例如 PBFT、Tendermint、Tusk & Narwhal 等）中，交易被确认就算是完成了；而在 PoW 或 PoS 共识链（例如 Nakamoto Consensus、Solana 和 Ethereum PoS）中，包含信息的区块需要进行传输和验证，结果导致网络延迟会比较长。

区块链网络的吞吐量是指系统每单位时间处理的总负载，通常以每秒交易数表示。

这两个关键指标似乎是彼此相反的，吞吐量以每秒交易数来衡量，延迟以秒为单位，我们自然会联想到吞吐量 = 负载 / 延迟。

然而事实并非如此，许多系统在生成表格时，习惯于在 Y 轴上显示吞吐量或者延迟，在 X 轴上显示节点数。但如果我们想要观察吞吐量和延迟之间的关系，最好 Y 轴代表延迟，X 轴代表吞吐量。从下图可以看出，它们显然不是线性关系。

当网络交易数较少时，延迟是恒定的，并可以通过改变负载来改变吞吐量。在这种情况下，交易延迟几乎为零，完成交易只需要一个固定的成本即可。

在网络交易数较多时，吞吐量是恒定的，但延迟可能会因为负载的变化而变化。如果系统已经超载，再增加更多负载会导致交易等待时间无限延长，延迟时间还会随着等待时间而变化。因此本文关键要点是应该选择在合适区域进行测量。考虑到吞吐量和延迟对基准测试的影响，测试不能在曲线边缘位置进行。

测量方法

在进行实验时，有三个主要的设计选项。

开环控制系统和闭环控制系统有两种主要方法可以控制对目标的请求。开环系统由 n= ∞个客户端建模，客户端根据速率 λ 和到达间隔分布（例如，泊松）向目标发送请求。闭环系统可以在任何给定时间内限制未完成请求的数量。开环环境和闭环环境的区别是统一规格系统可以部署在不同的场景中，例如一个键值储存可以在开环环境中为数千个应用程序服务器提供服务，或者在一个闭环环境中只为几个阻塞客户端提供服务。

选择正确的场景进行测试是必不可少的。闭环系统的延迟通常收到潜在未完成请求数量的限制，而开环系统可能等待着大量交易请求指令，从而导致更长的延迟。一般来说，区块链协议可以被任意数量的客户使用，并在开环环境中得到更准确的评估。

综合基础的到达间隔分布创建合成工作负载时需要主要的是如何提交交易请求。许多系统在测量开始之前会预加载事务，导致实际测量产生偏差。更好的方式是以确定的速率（如 1000TPS）发送请求，将会呈现 L 形图（橙色标记），此时为系统容量最佳使用状态。

但是开放系统通常不会以这种可预测的方式运行。它具有高负载和低负载时期，我们可以以概率达到间隔分布地方式进行预测，该分布通常基于泊松分布。这将呈现「曲棍球棒」图（蓝线标志）因为即使平均速率低于最佳值，泊松分布到达间隔也会导致一些交易等待延迟。这样我们就可以观察到系统是如何处理高负载以及恢复到正常时的速度。

准备阶段在选择开始时间时，最好保证通道中的交易处在满载状态，否则将会导致测量预热延迟。理想情况下，在准备阶段应该完成交易请求指令发出的延迟。这样有助于帮助测量结果遵循预期分布。

如何比较系统的各种部署如果比较系统的各种部署是最后一个困难。困难在于延迟和吞吐量是相互依赖的，导致难以生成具有具有可比性的吞吐量 / 节点数图表。解决这个问题的最好方法是定义服务级别目标（SLO）并测量此时的吞吐量，而不是简单地将每个系统以最大吞吐量的方式运行，因为此时的延迟毫无意义。在吞吐量 / 延迟图上绘制一条与 SLO 延迟轴相交的平行线，并对线上的点进行采样标记，这是一种可视化的方法。

如果负载超过 SLO 会发生什么？如果负载超过饱和点会很危险。如果系统操作配置不足，意外的交易请求将导致系统达到完全饱和，造成巨大延迟。在饱和点之后运行网络会处于一种不稳定的平衡，此时需要考虑两点：

升级系统配置：系统应该在饱和点一下运行，以便出现突发过量交易请求而导致延迟增加。如果 SLO 下方还有空间，可以增加交易数量。这将增加系统关键路径上的负载，并提供更高的吞吐量和更优化的延迟。

当负载很高时，应该如何测量延迟？

当负载很高时，尝试访问本地并为系统中的交易请求添加时间戳可能会导致结果出现偏差。还有两个更合适的选择，第一种是也是最简单的方法，对交易进行抽样。例如在某些交易请求中可能存在一个幻数，而这些是客户端为其保留计时器的唯一交易请求。在交易发出后，任何人都可以在链上查看交易的提交时间，从而计算他们的延迟。这种做法的主要优点是它不会干扰导到间隔分布。但是由于必须修改某些交易请求，它可能被认为是「hacky」

更加系统的方式是拥有两个负载生成器，一个是主负载生成器，遵循泊松分布；一个是请求生成器，负载较低并用于测量延迟。通过这种设定，我们可以只测量来自请求生成器的延迟。这种方法麻烦之处在于实际到达间隔分布式两个随机变量的总和，不过两个泊松分布的总和仍是泊松分布，在数学上并不难。

结论

测量大规模分布式系统至关重要的是识别交易请求数量瓶颈和当网络处于超载下的预期行为。希望上述方法可以对区块链的网络完善做出贡献，最终优化用户体验。

How to properly measure a (blockchain) system is one of the least talked about but most significant steps in its design and evaluation. There are numerous consensus protocols and variations with various performance and scalability tradeoffs. But as of yet, there is still no universally agreed-upon, reliable method that enables apples-to-apples comparisons. In this blog post, we outline a method inspired by measurements in data-center systems and discuss common errors to avoid when evaluating a blockchain network.

Key metrics and their interaction Two important metrics should be taken into account when developing a blockchain system: latency and throughput.

The first thing that users are concerned with is transaction latency, or the amount of time between initiating a transaction or payment and receiving confirmation that it is valid (for instance, that they have enough money). In classical BFT systems (e.g. PBFT, Tendermint, Tusk & Narwhal, etc), a transaction is finalized once it gets confirmed, whereas in longest-chain consensus (e.g. Nakamoto Consensus, Solana/Ethereum PoS), a transaction may get included in a block and then reorged. As a result, we need to wait until a transaction is "k-blocks deep," resulting in a latency that is significantly greater than a single confirmation.

Second, the throughput of the system is typically important to system designers. This is the total load that the system handles per unit of time, expressed typically in transactions per second.

At first glance, these two key metrics appear to be the inverse of one another. Because throughput is measured in transactions per second and latency is measured in seconds, we would naturally expect that Throughput = Load / Latency.

This, however, is not the case. This realization is difficult because many systems tend to produce graphs that display either the throughput or the latency on the y-axis with something like the number of nodes on the x-axis. Instead, a better graph to generate is that of a throughput/latency graph, which makes it apparent by not being linear.

When there is little contention, latency is constant, and throughput can be varied simply by changing the load. This occurs because there is a fixed minimum cost to commit a transaction and the queue delay is zero at low contention, resulting in "whatever comes in, comes out directly."

At high contention, throughput is constant, but latency can vary simply by changing the load.

This is because the system is already overloaded, and adding more load causes the wait queues to grow indefinitely. Even more counterintuitively, the latency appears to vary with experiment length. This is an artifact of infinitely growing queues.

All of this is visible on the classic "hockey stick graph" or "L-graph," depending on the interarrival distribution (as discussed later). As a result, the key takeaway from this blog post is that we should measure in the hot zone, where both throughput and latency affect our benchmark, rather than at the edges, where only one or the other matters.

Measuring Methodology When conducting an experiment, there are three main design options:

Open vs. closed loop There are two primary methods for controlling the flow of requests to the target. An open-loop system is modeled by n = ∞ clients that send requests to the target according to a rate λ and an inter-arrival distribution, e.g., Poisson. A closed-loop system limits the number of outstanding requests at any given time. The distinction between an open and closed loop system is a characteristic of a particular deployment, and the same system can be deployed in different scenarios. For instance, a key-value store may serve thousands of application servers in an open loop deployment or just a few blocking clients in a closed loop deployment.

Testing for the correct scenario is essential because, in contrast to closed-loop systems, which typically have latencies constrained by the number of potential outstanding requests, open-loop systems can produce significant queuing and, as a result, longer latencies. Generally speaking, blockchain protocols can be used by any number of clients and are more accurately evaluated in an open-loop environment.

Interarrival distribution for synthetic benchmarks A natural question to ask when creating a synthetic workload is how to submit requests. Many systems preload the transactions before the measurement begins, but this biases the measurements because the system starts at the unusual state of 0. Furthermore, preloaded requests are already in main memory and thus bypass the networking stack.

A slightly better approach would be to send requests at a deterministic rate (for example, 1000 TPS). This would lead to an L-shaped graph (orange) since there is optimal usage of the system’s capacity.

However, open systems frequently don't act in such a predictable way. They instead have periods of high and low load. To model this, we can employ a probabilistic interarrival distribution, which is typically based on the Poisson distribution. This will result in the "Hockey stick" graph (blue line) because the poisson bursts will cause some queuing delay (max capacity) even if the average rate is less than optimal. This is beneficial to us because we can see how the system handles high load and how quickly it recovers when the load returns to normal.

Warm-up Phase A final point to consider is when to begin measuring. We want the pipeline to be full of transactions before we begin; otherwise, warm-up delays will be measured. This should ideally be accomplished by measuring latency during the warm-up phase until the measurements follow the expected distribution.

How to compare The final difficulty is comparing the system's various deployments on an apples-to-apples basis. Again, the difficulty is that latency and throughput are interdependent, so it may be difficult to produce a fair throughput/number of nodes chart. Instead of simply pushing each system to its maximum throughput (where latency is meaningless), the best approach is to define a Service Level Objective (SLO) and measure the throughput at this point. Drawing a horizontal line at the throughput/latency graph that intersects the Latency axis at the SLO and sampling the points there is a nice way to visualize this.

But I have set an SLO of 5 seconds and it only takes 2 seconds. Someone might be tempted to increase the load here in order to take advantage of the marginally higher throughput available after the saturation point. But this is dangerous. If a system operation is underprovisioned, an unexpected burst of requests will cause the system to reach full saturation, resulting in an explosion of latency and a very rapid breach of the SLO. In essence, operating after the saturation point is an unstable equilibrium. As a result, there are two points to consider:

Overprovision your system. In essence, the system should operate under the saturation point so that bursts in the interarrival distribution are absorbed rather than lead to increased queueing delays.

If you have room under your SLO, increase the batch size. This will add load on the critical path of the system instead of the queuing delay and get you the higher throughput for higher latency tradeoff you are looking for.

I am generating an enormous load. How can I measure latency? When the load is high, trying to access the local clock and add a timestamp to every transaction that arrives on the system can lead to skewed results. Instead, there are two more viable options. The first and simplest method is to sample transactions; for example, there may be a magic number in some transactions that are the only ones for which the client keeps a timer. After commit time, anyone can inspect the blockchain to determine when these transactions were committed and thus compute their latency. The main advantage of this practice is that it does not interfere with interarrival distribution. However, it may be considered "hacky" because some transactions must be modified.

A more systematic approach would be to have two load generators. The first is the main load generator, which follows the Poisson distribution. The second request generator measures latency and has a much lower load; think of it as a single client in comparison to the rest of the system. Even if the system sends back replies to each and every request (as some systems do, such as a KV-store), we can easily drop all replies to the load generator and only measure the latency from the request generator. The only tricky part is that the actual interarrival distribution is the sum of the two random variables; however, the sum of two Poisson distributions is still a Poisson distribution, so the math isn't that difficult:).

Conclusions Measuring a large-scale distributed system is crucial for recognizing bottlenecks and profiling expected behaviour under stress. We hope that by using the above methods, we can all take the first step toward a common language, which will eventually lead to blockchain systems that are better suited for the work they do and the promises they make to end users.

In future work we plan to apply this methodology to existing consensus systems, if that's something of interest, please reach out on Twitter!.

Acknowledgements: All these are lessons learned with my co-authors during the design and implementation of Narwhal & Tusk (Best Paper Award @ Eurosys 2022) as well as comments on earlier drafts by Marios Kogias, Joachim Neu, Georgios Konstantopoulos, and Dan Robinson.

原文链接：

https://www.paradigm.xyz/2022/07/consensus-throughput

0xA776

Paradigm：如何准确测量区块链网络延迟和吞吐量

0xA776