Applied scientist studying algorithmic reputation and identity.

Sybil Detection by XGBoost
Public blockchains are transparent, accurate, and comprehensive records of their entire history. These freely available data sets are some of the largest and cleanest in the world, and they are highly amenable to the application of machine learning. In 2024, the Ethereum blockchain has about 200 million addresses, which is the same as the count of active websites on the internet. It is a global, public, financial dataset of internet scale. Ranking, classification, personalization, co-occurren...

Decentralized Onchain Microcredit
(Cover image credit: World Bank Flickr, used under a Creative Commons License.)An Onchain Approach to MicrocreditThe World Bank estimates that 1.4 billion people worldwide remained excluded from the global financial system, unable to open accounts, secure loans, or build credit histories. For these individuals, the barriers to economic opportunity are immense. Traditional credit scoring relies on access to formal financial data, including bank accounts, payment histories, and income documenta...

Solidity Interface for Uncollateralized Onchain Microfinance
This is a technical specification for a smart contract interface, intended for implementation by a solidity engineer. For an overview of onchain microlending targeted to a more general reader, please see the article Decentralized Onchain Microcredit.FunctionsdepositFunds(uint256 amount)Allows lenders to deposit funds into the protocol. The amount is in USDC, scaaled to 6 decimals. For example, 1,000,000 corresponds to 1 USDC, and passing 5,000,000 corresponds to 5 USDC.previewLoanTermsThis fu...

Sybil Detection by XGBoost
Public blockchains are transparent, accurate, and comprehensive records of their entire history. These freely available data sets are some of the largest and cleanest in the world, and they are highly amenable to the application of machine learning. In 2024, the Ethereum blockchain has about 200 million addresses, which is the same as the count of active websites on the internet. It is a global, public, financial dataset of internet scale. Ranking, classification, personalization, co-occurren...

Decentralized Onchain Microcredit
(Cover image credit: World Bank Flickr, used under a Creative Commons License.)An Onchain Approach to MicrocreditThe World Bank estimates that 1.4 billion people worldwide remained excluded from the global financial system, unable to open accounts, secure loans, or build credit histories. For these individuals, the barriers to economic opportunity are immense. Traditional credit scoring relies on access to formal financial data, including bank accounts, payment histories, and income documenta...

Solidity Interface for Uncollateralized Onchain Microfinance
This is a technical specification for a smart contract interface, intended for implementation by a solidity engineer. For an overview of onchain microlending targeted to a more general reader, please see the article Decentralized Onchain Microcredit.FunctionsdepositFunds(uint256 amount)Allows lenders to deposit funds into the protocol. The amount is in USDC, scaaled to 6 decimals. For example, 1,000,000 corresponds to 1 USDC, and passing 5,000,000 corresponds to 5 USDC.previewLoanTermsThis fu...
Applied scientist studying algorithmic reputation and identity.

Subscribe to Scott Onchain

Subscribe to Scott Onchain
<100 subscribers
<100 subscribers
Share Dialog
Share Dialog


A critical threat to the integrity of peer-to-peer networks, especially within blockchain ecosystems, is the Sybil attack. First identified by Douceur, this type of attack involves a single adversary creating multiple fake identities—known as Sybils—to undermine the network. These false identities can be leveraged to manipulate consensus mechanisms, disrupt the fair distribution of resources, or even execute double-spending attacks.
Sybil attacks are prevalent across a variety of domains, particularly in blockchain and decentralized finance. One common example occurs during token airdrops, where attackers generate multiple wallet addresses to claim an unfair share of the distributed tokens, compromising the fairness and intended distribution. In decentralized governance systems, Sybil attacks enable a single adversary to amass undue influence over voting processes, skewing decisions and potentially destabilizing the ecosystem. Similarly, in quadratic funding mechanisms, attackers exploit Sybil identities to disproportionately increase their influence and rewards, undermining the system's integrity.
Detecting Sybil attacks is inherently difficult, as attackers can convincingly mimic the behavior of legitimate users, often employing advanced techniques to bypass simple heuristics or rule-based defenses. Traditional Sybil detection methods typically fall into two categories: graph-based techniques and trust-based mechanisms. Graph-based approaches analyze the connectivity patterns of users within social networks to identify anomalies, while trust-based methods assess user reputation to differentiate between genuine and malicious actors. Although these techniques have demonstrated potential, they often face significant challenges, including high computational demands, limited applicability across domains, and vulnerability to evolving attack strategies.
Machine learning approaches, such as XGBoost, present a powerful alternative for addressing the Sybil attack problem. XGBoost, a highly scalable and efficient gradient boosting algorithm, excels in capturing complex non-linear relationships, making it a strong performer across diverse classification tasks. By utilizing features extracted from user behavior, interaction patterns, and network topologies, XGBoost provides a robust framework for distinguishing legitimate users from Sybil entities with remarkable precision.
This blog post delves into the application of XGBoost for Sybil detection, focusing on feature engineering strategies and model evaluation metrics. As a representative case, we consider the LayerZero airdrop of June 2024.
LayerZero, a cross-chain protocol, announced an airdrop in December 2023. Between the announcement and the distribution of tokens in June 2024, LayerZero went through a rigorous Sybil detection process. The process included community detection algorithms such as those used to mitigate sybils in the first Arbitrum airdrop, a self-reporting bounty for sybil attackers, and a crowdsourced sybil detection challenge, open to the general public. LayerZero cross-checked the results across sources, for example, applying k-means methodologies to validate the crowdsourced results. For this reason, the final sybil list reported by LayerZero is considered a good source of truth for actual Sybil detection.
There are known problems with the final LayerZero list, such as inclusion of the operating wallet of the cross-chain DEX Layerswap, and many reported false positives. In spite of the known noise, this file is an excellent source of truth, and it allows application of traditional machine learning methodology to the problem of Sybil detection.
The application of XGBoost to the LayerZero Sybil detection problem reveals a useful taxonomy, consisting of three disjoint categories of interacting addresses.
Several data sources are useful for predicting Sybil attackers, from different sources. The primary sources used here are Flipside, an analytics platform for onchain data, and The BigQuery public data sets for onchain data.
LayerZero Features
The first set of features has to do with the use of the LayerZero protocol. These features are sourced from Flipside. Flipside’s data encompasses a wide array of blockchain transactions, including LayerZero cross-chain interactions and Ethereum-specific transaction details. This rich and diverse dataset provides the foundation for feature extraction, enabling machine learning models to identify patterns indicative of Sybil attacks.
l0_avg_native_drop_usd / l0_max_native_drop_usd:The average and max amount in USD of a native transfer on LayerZero.
l0_to_eth_tx_time_span: Time span between the earliest and latest LayerZero transactions to Ethereum from the given address on another chain.
l0_tx_time_span: Time span between the earliest and latest LayerZero transactions from the address on the Ethereum network to another chain.
earliest_l0_to_eth_tx_time: Timestamp of the earliest Ethereum-targeting transaction. Indicates when the address first interacted with Ethereum. Note that Sybil attackers may be less likely to have very early transactions, or that their transactions may cluster around key dates for the LayerZero airdrop.
n_l0_dest_contracts: Number of distinct destination contracts in LayerZero transactions. This feature measures breadth of interaction within the LayerZero ecosystem.
n_l0_txs: Total number of LayerZero transactions, which provides a general measure of activity volume.
n_l0_source_chains: Number of distinct source chains for LayerZero transactions from this address.
n_l0_dest_chains: Number of distinct destination chains from this address.
earliest_l0_tx_time: Timestamp of the earliest LayerZero transaction from the address on the Ethereum network. Establishes historical activity context. Sybil attackers may be less likely to have very early transactions.
The heatmap of these features shows low correlation with the Sybil label, ranging between -0.06 and 0.11. The features in themselves are not highly predictive of Sybil activity, but they will be shown to be
predictive in combination with other features.

Ethereum Network Features
min_tx_value_out: Minimum value of outbound Ethereum transactions. Ths highlights small-value transactions, which may suggest automated or bot-like behavior.
num_transactions: Total number of outbound Ethereum transactions from the address.
out_degree_per_block_out: Ratio of unique recipient addresses to block span.
tx_value_per_block_out: Ratio of total transaction value to block span.
earliest_tx_block_in: Earliest block number of an inbound transaction to this address.
max_tx_value_in: Maximum value of inbound Ethereum transactions.
avg_tx_value_in: Average value of inbound transactions.
indegree_per_block_in: Ratio of unique senders to block span.

While the individual correlation of these features with the Sybil label is low, their distributions show visible differentiation between Sybil and non-Sybil accounts. Combined with other metrics in a multi-feature model, these transaction-based features add complementary value by capturing nuances in transaction behavior:
Volume and Value: Features like num_transactions, min_tx_value_out, and avg_tx_value_in highlight the activity intensity and transaction size, which can reveal irregular patterns when viewed together.
Distribution Patterns: Metrics like out_degree_per_block_out and indegree_per_block_in provide insights into how transactions are spread across time and participants.
Temporal Context: Features such as earliest_tx_block_in differentiate younger Sybil accounts from established non-Sybil addresses.
These metrics are particularly valuable in conjunction with advanced models such as XGBoost, where non-linear relationships can amplify their predictive utility despite low individual correlations. By retaining these features, we ensure the model has a robust view of transaction dynamics to identify signals of Sybil behavior.


Stargate-Related Features
Additional features, more specific to LayerZero, can be shown to be accretive to metrics when combined with complementary features. For example, Stargate is a cross-chain liquidity protocol built on LayerZero, designed to facilitate seamless asset transfers across different blockchains. A Stargate swap allows users to exchange assets between chains with unified liquidity pools, making it a cornerstone of LayerZero's interoperability framework.
l0_to_eth_max_stargate_swap: Maximum USD value of Stargate swaps for transactions coming in to the Ethereum network.
l0_to_eth_avg_stargate_swap: Average Stargate swap value for Ethereum-targeting transactions.
l0_max_stargate_swap: Maximum Stargate swap value across all LayerZero transactions going from Ethereum to another network.
The heatmap shows that there is, again, low correlation between these stargate features and the sybil label, but features can interact to give better precision.

To demonstrate the effectiveness of these features in combination with the other model features, the three charts below show the AUC for the Stargate feature in isolation, the AUC of a complementary feature (on the x-axis), and the AUC for the two features combined. In every case, the Stargate feature combined with the complementary feature shows better AUC.



Features from the Gas Provision Network
The gas provision network represents a directed graph that maps interactions between gas providers and activated addresses within a blockchain ecosystem. Gas provision, a fundamental process in blockchain systems, involves the first allocation of computational resources (or "gas") necessary for executing transactions and smart contracts. In the case of the Ethereum network, gas provision involves Ethereum.
This network captures the flow of gas provision activities and offers critical insights into transactional behavior, sybil activity, and address interactions. Note that this network is typically a forest, a collection of multiple graphs with no cycles.
Nodes:
Gas Providers: Nodes that act as sources, supplying the first Ethereum to other addresses.
Activated Addresses: Nodes that represent recipient accounts or entities that receive gas from gas providers.
Edges: Directed edges connect a gas provider to an activated address, representing the provision of gas.
The gas provision network is analyzed as a hierarchical structure. Starting with root addresses (those receiving ETH directly), a recursive process computes metrics for each tree or subtree of gas provisioning.
The following features are extracted:
Tree-Based Metrics
provider_fan_out: The number of distinct addresses activated by a gas provider. High values may indicate star-like patterns or potential Sybil behavior.
tree_size: The total number of nodes in the gas provision tree.
max_depth: The maximum depth of the tree, representing how far ETH provisioning propagates.
balance_factor: Difference between the deepest and shallowest leaf depths.
branching_factor: The average number of child nodes per parent. Higher values suggest broader distribution patterns.
star_like_ratio: The proportion of nodes in the tree exhibiting star-like behavior.
longest_chain_ratio: The ratio of the longest path depth to the total tree size, indicating linear provisioning structures.
Gas Distribution Metrics:
gas_distribution_entropy: Entropy of gas provisions across the nodes, measuring randomness.
$$H = -\sum_{i=1}^{N} p_i \log(p_i)
$$$
where is the proportion of the total gas provision allocated to node , and is the total number of nodes.
gas_distribution_skewness: Skewness of gas provision amounts—high skewness may indicate disproportionately large provisions to a few nodes.where is the gas provision amount for node , is the mean gas provision, and is the total number of nodes.
where is the gas provision amount for leaf node , is the mean gas provision at the leaf nodes, and is the total number of leaf nodes.

About 4.4% of the Ethereum addresses who interacted with LayerZero are sybils. Since we are interested in binary classification, we balance this training set.
We divide the data into 70% training and 30% test, reserving 30% of the training set for validation prior to balancing. The remainder of the training set is balanced, and we train on balanced data, validating and testing on imbalanced data.
To detect Sybil behavior, we implemented an XGBoost classifier. XGBoost is well-suited for handling large datasets with complex relationships, thanks to its tree-based architecture. We set the hyperparameters to typical defaults:
Objective: binary:logistic
Evaluation Metric: logloss
Number of Estimators: 2000
Studying the result of the XGBoost Classfier, there are three distinct scenarios where interacting addresses are provisioned, each presenting unique traits and predictive performance when evaluated by machine learning models.
Addresses are classified by the characteristics of their gas provider, which may have interacted with LayerZero, may be a labeled address (like a cex or dex), or may be an EOA which did not interact with LayerZero.

Interacting Addresses Provisioned by Other Interacting Addresses
The first category of addresses is those provisioned by another address, which itself interacted with LayerZero.
These addresses represent a peer-to-peer dynamic, where one interacting address directly provisions another. This scenario is particularly noteworthy because the provisioning and receiving entities share behavioral patterns that are more aligned with genuine network interactions.

The XGBoost model excels at detecting Sybil tendencies in this group, achieving high precision and recall metrics. In particular, the F1 score is 0.844.
Confusion Matrix (Optimal Threshold, Validation Set):
[[7149 34]
[ 66 270]]
Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.991 0.995 0.993 7183
True 0.888 0.804 0.844 336
accuracy 0.987 7519
macro avg 0.940 0.899 0.918 7519
weighted avg 0.986 0.987 0.986 7519
For this case, the shape of the precision-recall curve is close to optimal.

Interacting Addresses Provisioned by Labeled Addresses
The second category of addresses consists of addresses which were provisioned by a cex, dex, or other labeled address.
Here, the provisioning entity is a well-known centralized or decentralized exchange (CEX or DEX). These labeled entities play a critical role in facilitating interactions on the network. The addresses they provision tend to exhibit semi-regular patterns of activity that reflect both legitimate and Sybil-like behaviors. Models perform moderately well in this category, balancing between precision and false positives, as these addresses often blur the line between organic and anomalous behavior.

Confusion Matrix (Optimal Threshold, Validation Set):
[[59102 990]
[ 787 2548]]
Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.987 0.984 0.985 60092
True 0.720 0.764 0.741 3335
accuracy 0.972 63427
macro avg 0.854 0.874 0.863 63427
weighted avg 0.973 0.972 0.972 63427
Interacting Addresses Provisioned by Non-Interacting EOAs
The third scenario involves addresses provisioned by externally owned accounts (EOAs) that are not themselves active interactors. These addresses provision interacting accounts but lack the behavioral depth seen in the previous categories.
This absence of robust interaction data makes them harder to evaluate effectively. As a result, the model struggles, often failing to accurately distinguish between genuine and Sybil-like patterns in this group. These provisioning accounts might be dormant, newly created, or simply sporadic in their activity, which challenges standard detection mechanisms.

Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.996 0.999 0.997 20207
True 0.734 0.448 0.556 154
accuracy 0.995 20361
macro avg 0.865 0.723 0.777 20361
weighted avg 0.994 0.995 0.994 20361
While the F1 score is low for this category, the precision-recall curve shows very high precision (1.0) for the highest-scoring observations in the validation set. This implies that, if the threshold for Sybil detection is set sufficiently high, the classifier will avoid most false positives.

This third category is the least understood, and the best candidate for future investigation.
Overall Result
Combining the three categories of interacting addresses gives a positive result overall, with the F1 score of 0.743.
Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.988 0.991 0.989 87481
True 0.770 0.719 0.743 3824
accuracy 0.979 91305
macro avg 0.879 0.855 0.866 91305
weighted avg 0.979 0.979 0.979 91305
This investigation has uncovered three categories of nodes interacting with the LayerZero network: Those provisioned by another interacting address, those provisioned by a labeled address, such as a cex or dex, and those provisioned by an EOA that did not interact with LayerZero. The final category proves to be the hardest to classify, and deeper analysis into this category’s metrics may be productive for refinement of this XGBoost model.
A critical threat to the integrity of peer-to-peer networks, especially within blockchain ecosystems, is the Sybil attack. First identified by Douceur, this type of attack involves a single adversary creating multiple fake identities—known as Sybils—to undermine the network. These false identities can be leveraged to manipulate consensus mechanisms, disrupt the fair distribution of resources, or even execute double-spending attacks.
Sybil attacks are prevalent across a variety of domains, particularly in blockchain and decentralized finance. One common example occurs during token airdrops, where attackers generate multiple wallet addresses to claim an unfair share of the distributed tokens, compromising the fairness and intended distribution. In decentralized governance systems, Sybil attacks enable a single adversary to amass undue influence over voting processes, skewing decisions and potentially destabilizing the ecosystem. Similarly, in quadratic funding mechanisms, attackers exploit Sybil identities to disproportionately increase their influence and rewards, undermining the system's integrity.
Detecting Sybil attacks is inherently difficult, as attackers can convincingly mimic the behavior of legitimate users, often employing advanced techniques to bypass simple heuristics or rule-based defenses. Traditional Sybil detection methods typically fall into two categories: graph-based techniques and trust-based mechanisms. Graph-based approaches analyze the connectivity patterns of users within social networks to identify anomalies, while trust-based methods assess user reputation to differentiate between genuine and malicious actors. Although these techniques have demonstrated potential, they often face significant challenges, including high computational demands, limited applicability across domains, and vulnerability to evolving attack strategies.
Machine learning approaches, such as XGBoost, present a powerful alternative for addressing the Sybil attack problem. XGBoost, a highly scalable and efficient gradient boosting algorithm, excels in capturing complex non-linear relationships, making it a strong performer across diverse classification tasks. By utilizing features extracted from user behavior, interaction patterns, and network topologies, XGBoost provides a robust framework for distinguishing legitimate users from Sybil entities with remarkable precision.
This blog post delves into the application of XGBoost for Sybil detection, focusing on feature engineering strategies and model evaluation metrics. As a representative case, we consider the LayerZero airdrop of June 2024.
LayerZero, a cross-chain protocol, announced an airdrop in December 2023. Between the announcement and the distribution of tokens in June 2024, LayerZero went through a rigorous Sybil detection process. The process included community detection algorithms such as those used to mitigate sybils in the first Arbitrum airdrop, a self-reporting bounty for sybil attackers, and a crowdsourced sybil detection challenge, open to the general public. LayerZero cross-checked the results across sources, for example, applying k-means methodologies to validate the crowdsourced results. For this reason, the final sybil list reported by LayerZero is considered a good source of truth for actual Sybil detection.
There are known problems with the final LayerZero list, such as inclusion of the operating wallet of the cross-chain DEX Layerswap, and many reported false positives. In spite of the known noise, this file is an excellent source of truth, and it allows application of traditional machine learning methodology to the problem of Sybil detection.
The application of XGBoost to the LayerZero Sybil detection problem reveals a useful taxonomy, consisting of three disjoint categories of interacting addresses.
Several data sources are useful for predicting Sybil attackers, from different sources. The primary sources used here are Flipside, an analytics platform for onchain data, and The BigQuery public data sets for onchain data.
LayerZero Features
The first set of features has to do with the use of the LayerZero protocol. These features are sourced from Flipside. Flipside’s data encompasses a wide array of blockchain transactions, including LayerZero cross-chain interactions and Ethereum-specific transaction details. This rich and diverse dataset provides the foundation for feature extraction, enabling machine learning models to identify patterns indicative of Sybil attacks.
l0_avg_native_drop_usd / l0_max_native_drop_usd:The average and max amount in USD of a native transfer on LayerZero.
l0_to_eth_tx_time_span: Time span between the earliest and latest LayerZero transactions to Ethereum from the given address on another chain.
l0_tx_time_span: Time span between the earliest and latest LayerZero transactions from the address on the Ethereum network to another chain.
earliest_l0_to_eth_tx_time: Timestamp of the earliest Ethereum-targeting transaction. Indicates when the address first interacted with Ethereum. Note that Sybil attackers may be less likely to have very early transactions, or that their transactions may cluster around key dates for the LayerZero airdrop.
n_l0_dest_contracts: Number of distinct destination contracts in LayerZero transactions. This feature measures breadth of interaction within the LayerZero ecosystem.
n_l0_txs: Total number of LayerZero transactions, which provides a general measure of activity volume.
n_l0_source_chains: Number of distinct source chains for LayerZero transactions from this address.
n_l0_dest_chains: Number of distinct destination chains from this address.
earliest_l0_tx_time: Timestamp of the earliest LayerZero transaction from the address on the Ethereum network. Establishes historical activity context. Sybil attackers may be less likely to have very early transactions.
The heatmap of these features shows low correlation with the Sybil label, ranging between -0.06 and 0.11. The features in themselves are not highly predictive of Sybil activity, but they will be shown to be
predictive in combination with other features.

Ethereum Network Features
min_tx_value_out: Minimum value of outbound Ethereum transactions. Ths highlights small-value transactions, which may suggest automated or bot-like behavior.
num_transactions: Total number of outbound Ethereum transactions from the address.
out_degree_per_block_out: Ratio of unique recipient addresses to block span.
tx_value_per_block_out: Ratio of total transaction value to block span.
earliest_tx_block_in: Earliest block number of an inbound transaction to this address.
max_tx_value_in: Maximum value of inbound Ethereum transactions.
avg_tx_value_in: Average value of inbound transactions.
indegree_per_block_in: Ratio of unique senders to block span.

While the individual correlation of these features with the Sybil label is low, their distributions show visible differentiation between Sybil and non-Sybil accounts. Combined with other metrics in a multi-feature model, these transaction-based features add complementary value by capturing nuances in transaction behavior:
Volume and Value: Features like num_transactions, min_tx_value_out, and avg_tx_value_in highlight the activity intensity and transaction size, which can reveal irregular patterns when viewed together.
Distribution Patterns: Metrics like out_degree_per_block_out and indegree_per_block_in provide insights into how transactions are spread across time and participants.
Temporal Context: Features such as earliest_tx_block_in differentiate younger Sybil accounts from established non-Sybil addresses.
These metrics are particularly valuable in conjunction with advanced models such as XGBoost, where non-linear relationships can amplify their predictive utility despite low individual correlations. By retaining these features, we ensure the model has a robust view of transaction dynamics to identify signals of Sybil behavior.


Stargate-Related Features
Additional features, more specific to LayerZero, can be shown to be accretive to metrics when combined with complementary features. For example, Stargate is a cross-chain liquidity protocol built on LayerZero, designed to facilitate seamless asset transfers across different blockchains. A Stargate swap allows users to exchange assets between chains with unified liquidity pools, making it a cornerstone of LayerZero's interoperability framework.
l0_to_eth_max_stargate_swap: Maximum USD value of Stargate swaps for transactions coming in to the Ethereum network.
l0_to_eth_avg_stargate_swap: Average Stargate swap value for Ethereum-targeting transactions.
l0_max_stargate_swap: Maximum Stargate swap value across all LayerZero transactions going from Ethereum to another network.
The heatmap shows that there is, again, low correlation between these stargate features and the sybil label, but features can interact to give better precision.

To demonstrate the effectiveness of these features in combination with the other model features, the three charts below show the AUC for the Stargate feature in isolation, the AUC of a complementary feature (on the x-axis), and the AUC for the two features combined. In every case, the Stargate feature combined with the complementary feature shows better AUC.



Features from the Gas Provision Network
The gas provision network represents a directed graph that maps interactions between gas providers and activated addresses within a blockchain ecosystem. Gas provision, a fundamental process in blockchain systems, involves the first allocation of computational resources (or "gas") necessary for executing transactions and smart contracts. In the case of the Ethereum network, gas provision involves Ethereum.
This network captures the flow of gas provision activities and offers critical insights into transactional behavior, sybil activity, and address interactions. Note that this network is typically a forest, a collection of multiple graphs with no cycles.
Nodes:
Gas Providers: Nodes that act as sources, supplying the first Ethereum to other addresses.
Activated Addresses: Nodes that represent recipient accounts or entities that receive gas from gas providers.
Edges: Directed edges connect a gas provider to an activated address, representing the provision of gas.
The gas provision network is analyzed as a hierarchical structure. Starting with root addresses (those receiving ETH directly), a recursive process computes metrics for each tree or subtree of gas provisioning.
The following features are extracted:
Tree-Based Metrics
provider_fan_out: The number of distinct addresses activated by a gas provider. High values may indicate star-like patterns or potential Sybil behavior.
tree_size: The total number of nodes in the gas provision tree.
max_depth: The maximum depth of the tree, representing how far ETH provisioning propagates.
balance_factor: Difference between the deepest and shallowest leaf depths.
branching_factor: The average number of child nodes per parent. Higher values suggest broader distribution patterns.
star_like_ratio: The proportion of nodes in the tree exhibiting star-like behavior.
longest_chain_ratio: The ratio of the longest path depth to the total tree size, indicating linear provisioning structures.
Gas Distribution Metrics:
gas_distribution_entropy: Entropy of gas provisions across the nodes, measuring randomness.
$$H = -\sum_{i=1}^{N} p_i \log(p_i)
$$$
where is the proportion of the total gas provision allocated to node , and is the total number of nodes.
gas_distribution_skewness: Skewness of gas provision amounts—high skewness may indicate disproportionately large provisions to a few nodes.where is the gas provision amount for node , is the mean gas provision, and is the total number of nodes.
where is the gas provision amount for leaf node , is the mean gas provision at the leaf nodes, and is the total number of leaf nodes.

About 4.4% of the Ethereum addresses who interacted with LayerZero are sybils. Since we are interested in binary classification, we balance this training set.
We divide the data into 70% training and 30% test, reserving 30% of the training set for validation prior to balancing. The remainder of the training set is balanced, and we train on balanced data, validating and testing on imbalanced data.
To detect Sybil behavior, we implemented an XGBoost classifier. XGBoost is well-suited for handling large datasets with complex relationships, thanks to its tree-based architecture. We set the hyperparameters to typical defaults:
Objective: binary:logistic
Evaluation Metric: logloss
Number of Estimators: 2000
Studying the result of the XGBoost Classfier, there are three distinct scenarios where interacting addresses are provisioned, each presenting unique traits and predictive performance when evaluated by machine learning models.
Addresses are classified by the characteristics of their gas provider, which may have interacted with LayerZero, may be a labeled address (like a cex or dex), or may be an EOA which did not interact with LayerZero.

Interacting Addresses Provisioned by Other Interacting Addresses
The first category of addresses is those provisioned by another address, which itself interacted with LayerZero.
These addresses represent a peer-to-peer dynamic, where one interacting address directly provisions another. This scenario is particularly noteworthy because the provisioning and receiving entities share behavioral patterns that are more aligned with genuine network interactions.

The XGBoost model excels at detecting Sybil tendencies in this group, achieving high precision and recall metrics. In particular, the F1 score is 0.844.
Confusion Matrix (Optimal Threshold, Validation Set):
[[7149 34]
[ 66 270]]
Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.991 0.995 0.993 7183
True 0.888 0.804 0.844 336
accuracy 0.987 7519
macro avg 0.940 0.899 0.918 7519
weighted avg 0.986 0.987 0.986 7519
For this case, the shape of the precision-recall curve is close to optimal.

Interacting Addresses Provisioned by Labeled Addresses
The second category of addresses consists of addresses which were provisioned by a cex, dex, or other labeled address.
Here, the provisioning entity is a well-known centralized or decentralized exchange (CEX or DEX). These labeled entities play a critical role in facilitating interactions on the network. The addresses they provision tend to exhibit semi-regular patterns of activity that reflect both legitimate and Sybil-like behaviors. Models perform moderately well in this category, balancing between precision and false positives, as these addresses often blur the line between organic and anomalous behavior.

Confusion Matrix (Optimal Threshold, Validation Set):
[[59102 990]
[ 787 2548]]
Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.987 0.984 0.985 60092
True 0.720 0.764 0.741 3335
accuracy 0.972 63427
macro avg 0.854 0.874 0.863 63427
weighted avg 0.973 0.972 0.972 63427
Interacting Addresses Provisioned by Non-Interacting EOAs
The third scenario involves addresses provisioned by externally owned accounts (EOAs) that are not themselves active interactors. These addresses provision interacting accounts but lack the behavioral depth seen in the previous categories.
This absence of robust interaction data makes them harder to evaluate effectively. As a result, the model struggles, often failing to accurately distinguish between genuine and Sybil-like patterns in this group. These provisioning accounts might be dormant, newly created, or simply sporadic in their activity, which challenges standard detection mechanisms.

Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.996 0.999 0.997 20207
True 0.734 0.448 0.556 154
accuracy 0.995 20361
macro avg 0.865 0.723 0.777 20361
weighted avg 0.994 0.995 0.994 20361
While the F1 score is low for this category, the precision-recall curve shows very high precision (1.0) for the highest-scoring observations in the validation set. This implies that, if the threshold for Sybil detection is set sufficiently high, the classifier will avoid most false positives.

This third category is the least understood, and the best candidate for future investigation.
Overall Result
Combining the three categories of interacting addresses gives a positive result overall, with the F1 score of 0.743.
Classification Report (Optimal Threshold, Validation Set):
precision recall f1-score support
False 0.988 0.991 0.989 87481
True 0.770 0.719 0.743 3824
accuracy 0.979 91305
macro avg 0.879 0.855 0.866 91305
weighted avg 0.979 0.979 0.979 91305
This investigation has uncovered three categories of nodes interacting with the LayerZero network: Those provisioned by another interacting address, those provisioned by a labeled address, such as a cex or dex, and those provisioned by an EOA that did not interact with LayerZero. The final category proves to be the hardest to classify, and deeper analysis into this category’s metrics may be productive for refinement of this XGBoost model.
leaf_gas_distribution_entropy: Entropy of gas provision amounts at the leaf nodes.where is the proportion of the total gas provision at leaf node , and is the number of leaf nodes.
leaf_gas_distribution_skewness: Skewness of gas amounts at leaf nodes.$$ \text{Skewness}{\text{leaf}} = \frac{\frac{1}{M} \sum{j=1}^{M} (x_j - \bar{x}{\text{leaf}})^3}{\left(\frac{1}{M} \sum{j=1}^{M} (x_j - \bar{x}_{\text{leaf}})^2\right)^{3/2}} $$$
leaf_gas_distribution_entropy: Entropy of gas provision amounts at the leaf nodes.where is the proportion of the total gas provision at leaf node , and is the number of leaf nodes.
leaf_gas_distribution_skewness: Skewness of gas amounts at leaf nodes.$$ \text{Skewness}{\text{leaf}} = \frac{\frac{1}{M} \sum{j=1}^{M} (x_j - \bar{x}{\text{leaf}})^3}{\left(\frac{1}{M} \sum{j=1}^{M} (x_j - \bar{x}_{\text{leaf}})^2\right)^{3/2}} $$$
No activity yet