Base Mainnet 09/21/24 Incident Postmortem

Lessons learned from Base’s recent block building outage

Base is committed to building in the open, including public retrospectives to share learnings when issues arise.

On 09/21/2024 at 15:14 UTC, Base Mainnet experienced a 17 minute block building outage. The integrity of the chain was not affected, all funds on Base were safe, and block production resumed after we mitigated the incident. This retrospective dives into the root cause, the impact, how we mitigated, and what we plan to improve moving forward.

The root cause of the block building outage was a misconfiguration on our sequencer cluster. When the current block producer became unhealthy, it was unable to successfully start block building on another instance. The incident was mitigated by manually starting block production on a correctly configured instance.

Graph displaying the stall in the progression of the chain, which is measured by unsafe head block number

Impact

Block Production

No blocks were produced for 17 minutes, beginning at 15:14 UTC. Blocks 20071146 to 20071691 contain no user transactions, as they were created by the protocol after sequencing resumed.

Transaction Processing

Transactions are submitted to Base through the `eth_sendRawTransaction` RPC call, which places them in the mempool. During the incident, the mempool instances continued to function correctly. However, fewer transactions were submitted in that time frame, which can be seen in the graph below.

There was an immediate drop in both successful and failed `eth_sendRawTransaction` requests after the outage started, followed by a slow rise in failed requests. Our current hypothesis is that less transactions were submitted because applications were impaired by the halt in block production.

Graph of RPC request status for eth_sendRawTransaction calls to our routing services

Once block production resumed, many of the transactions that were submitted during the incident were included in the blocks immediately following 20071691.

Root Cause

Background

Over the past year, Base has designed and built op-conductor to improve the reliability of block production. Our goal with building op-conductor is to increase the overall availability of the system, with a target of achieving 99.99% availability. Prior to op-conductor, any failure of the sequencer would result in an outage. op-conductor enables us to operate multiple sequencers and upon a failure start block production on a healthy instance.

Diagram of before and after migrating to the op-conductor enabled sequencer cluster

On 9/20/2024, we migrated block production from the single sequencer to the op-conductor cluster. However the op-conductor instances were in a misconfigured state, where op-node was not submitting new unsafe block payloads to op-conductor.

Trigger

On 9/21/2024 at 15:14 UTC, the currently active sequencer experienced delays in block production. op-conductor correctly detected the issue and began the process to transfer leadership to another instance. As part of the leadership transfer, op-conductor stopped the local op-node from building blocks.

Due to the misconfiguration, the new block producer was unable to start production as the start operation requires the unsafe payload from op-conductor, which the previous leader did not write. This caused the cluster to enter a state in which no instance was able to become an active block producer.

Below is a log snippet containing one sample of a failed leadership transfer:

Mitigation

The incident was mitigated by reverting to the single sequencer topology while the op-conductor cluster configuration was fixed.

What we’re fixing going forward

We implemented a bidirectional handshake between op-node and op-conductor at startup to ensure proper communication configuration.
Improve our internal configuration management process to prevent and detect misconfigurations.

More from Base

Base

Apr 29

Base has reached Stage 1 Decentralization

TLDR: Base has achieved Stage 1 Decentralization, a critical milestone in our journey to build an open and global onchain economy. We’ve done this by launching permissionless fault proofs and increasing the decentralization of our contract upgrade process with a security council. We believe that decentralization is critical to deliver on our mission of building a global onchain economy and are proud to have achieved this milestone.What decentralization means for BaseBase’s mission is to build...

Base

Feb 27

Building for the long-term: making Base faster, simpler, and more powerful

TLDR: We’re introducing new building blocks that make it faster, simpler, and more powerful to build on Base: Flashblocks, Smart Wallet Sub Accounts, and Base Appchains — plus a new home base for builders.Base is building for the long-termBase’s mission is to build a global onchain economy that increases innovation, creativity, and freedom. To further our mission, we need to continue making Base more powerful, easier to use, and faster than ever. We are focused on cultivating an ecosystem of ...

Cover image for Expanding Global Access to Crypto with Onboard

Base

Feb 24

Expanding Global Access to Crypto with Onboard

TLDR: Coinbase Wallet has integrated Onboard P2P as an onramp option to make buying crypto easier around the world. Onboard lets anyone purchase crypto with local currency through a peer-to-peer exchange, without lengthy verification, and lower fees. Coinbase Wallet and Base are committed to building a global onchain economy that increases innovation, creativity, and freedom. To achieve this mission, we need to make getting onchain as easy as possible – in every country in the world. However,...

Lessons learned from Base’s recent block building outage

Base is committed to building in the open, including public retrospectives to share learnings when issues arise.

Impact

Block Production

No blocks were produced for 17 minutes, beginning at 15:14 UTC. Blocks 20071146 to 20071691 contain no user transactions, as they were created by the protocol after sequencing resumed.

Transaction Processing

Once block production resumed, many of the transactions that were submitted during the incident were included in the blocks immediately following 20071691.

Root Cause

Background

Trigger

Below is a log snippet containing one sample of a failed leadership transfer:

Mitigation

The incident was mitigated by reverting to the single sequencer topology while the op-conductor cluster configuration was fixed.

What we’re fixing going forward

We implemented a bidirectional handshake between op-node and op-conductor at startup to ensure proper communication configuration.
Improve our internal configuration management process to prevent and detect misconfigurations.

Base Mainnet 09/21/24 Incident Postmortem

Lessons learned from Base’s recent block building outage

Base is committed to building in the open, including public retrospectives to share learnings when issues arise.

Impact

Block Production

No blocks were produced for 17 minutes, beginning at 15:14 UTC. Blocks 20071146 to 20071691 contain no user transactions, as they were created by the protocol after sequencing resumed.

Transaction Processing

Once block production resumed, many of the transactions that were submitted during the incident were included in the blocks immediately following 20071691.

Root Cause

Background

Trigger

Below is a log snippet containing one sample of a failed leadership transfer:

Mitigation

The incident was mitigated by reverting to the single sequencer topology while the op-conductor cluster configuration was fixed.

What we’re fixing going forward

We implemented a bidirectional handshake between op-node and op-conductor at startup to ensure proper communication configuration.
Improve our internal configuration management process to prevent and detect misconfigurations.

More from Base

Base

Apr 29

Base has reached Stage 1 Decentralization

Base

Feb 27

Building for the long-term: making Base faster, simpler, and more powerful

Base

Feb 24

Expanding Global Access to Crypto with Onboard

Subscribe to Base

>460K subscribers

Subscribe to Base

>460K subscribers

Base

Base Mainnet 09/21/24 Incident Postmortem

Impact

Root Cause

What we’re fixing going forward

More from Base

Base

More from Base

1 comment

Base Mainnet 09/21/24 Incident Postmortem

Impact

Root Cause

What we’re fixing going forward

More from Base

Base

Base

Base Mainnet 09/21/24 Incident Postmortem

Impact

Root Cause

What we’re fixing going forward

1 comment

More from Base

1 comment

1 comment