
Sybil Detection by XGBoost
Public blockchains are transparent, accurate, and comprehensive records of their entire history. These freely available data sets are some of the largest and cleanest in the world, and they are highly amenable to the application of machine learning. In 2024, the Ethereum blockchain has about 200 million addresses, which is the same as the count of active websites on the internet. It is a global, public, financial dataset of internet scale. Ranking, classification, personalization, co-occurren...

A Taxonomy of LayerZero Network Users Arising from Sybil Analysis
IntroductionA critical threat to the integrity of peer-to-peer networks, especially within blockchain ecosystems, is the Sybil attack. First identified by Douceur, this type of attack involves a single adversary creating multiple fake identities—known as Sybils—to undermine the network. These false identities can be leveraged to manipulate consensus mechanisms, disrupt the fair distribution of resources, or even execute double-spending attacks. Sybil attacks are prevalent across a variety of ...

Decentralized Onchain Microcredit
(Cover image credit: World Bank Flickr, used under a Creative Commons License.)An Onchain Approach to MicrocreditThe World Bank estimates that 1.4 billion people worldwide remained excluded from the global financial system, unable to open accounts, secure loans, or build credit histories. For these individuals, the barriers to economic opportunity are immense. Traditional credit scoring relies on access to formal financial data, including bank accounts, payment histories, and income documenta...
Applied scientist studying algorithmic reputation and identity.



Sybil Detection by XGBoost
Public blockchains are transparent, accurate, and comprehensive records of their entire history. These freely available data sets are some of the largest and cleanest in the world, and they are highly amenable to the application of machine learning. In 2024, the Ethereum blockchain has about 200 million addresses, which is the same as the count of active websites on the internet. It is a global, public, financial dataset of internet scale. Ranking, classification, personalization, co-occurren...

A Taxonomy of LayerZero Network Users Arising from Sybil Analysis
IntroductionA critical threat to the integrity of peer-to-peer networks, especially within blockchain ecosystems, is the Sybil attack. First identified by Douceur, this type of attack involves a single adversary creating multiple fake identities—known as Sybils—to undermine the network. These false identities can be leveraged to manipulate consensus mechanisms, disrupt the fair distribution of resources, or even execute double-spending attacks. Sybil attacks are prevalent across a variety of ...

Decentralized Onchain Microcredit
(Cover image credit: World Bank Flickr, used under a Creative Commons License.)An Onchain Approach to MicrocreditThe World Bank estimates that 1.4 billion people worldwide remained excluded from the global financial system, unable to open accounts, secure loans, or build credit histories. For these individuals, the barriers to economic opportunity are immense. Traditional credit scoring relies on access to formal financial data, including bank accounts, payment histories, and income documenta...
Share Dialog
Share Dialog
Applied scientist studying algorithmic reputation and identity.

Subscribe to Scott Onchain

Subscribe to Scott Onchain
In decentralized networks, a sybil attacker refers to one entity impersonating many for the purpose of exploiting the network. Detection and mitigation of of sybil attackers is crucial to maintaining the stability of decentralized networks. A common example in blockchain is when a network or application performs an “airdrop”, distributing a reward to users of the network. Such airdrops invite sybil attackers, trying to multiply their rewards.
The clustering coefficient of a node is a measure of the degree to which ’s neighborhood is complete. It is defined in terms of triangles of the node . For an undirected graph, a triangle of is a connected subgraph consisting of and two neighbors.
In decentralized networks, a sybil attacker refers to one entity impersonating many for the purpose of exploiting the network. Detection and mitigation of of sybil attackers is crucial to maintaining the stability of decentralized networks. A common example in blockchain is when a network or application performs an “airdrop”, distributing a reward to users of the network. Such airdrops invite sybil attackers, trying to multiply their rewards.
The clustering coefficient of a node is a measure of the degree to which ’s neighborhood is complete. It is defined in terms of triangles of the node . For an undirected graph, a triangle of is a connected subgraph consisting of and two neighbors.

For a directed graph, the connected subgraph must be a cycle.

In social networks, the clustering coefficient is used to find “small world” networks, i.e. communities that are highly interconnected. The use of clustering coefficients in this way is very similar to community detection used in sybil mitigation, in particular the Louvain Community Detection algorithm. The Louvain algorithm was used effectively to mitigate sybils in the first Arbitrum airdrop.
For the case of a weighted, undirected graph, the definition of the clustering coefficient is not standardized, but a common definition is average weight of all “triangles”:
$$c_u = \frac{1}{deg(u)(deg(u)-1))} \sum_{vw} (\hat{w}{uv} \hat{w}{uw} \hat{w}_{vw})^{1/3}.$$
Here, the weights are normalized by dividing by the maximum edge weight in the neighborhood. Note:
The denominator is the number of triangles if the neighborhood is fully connected (a clique).
The summand $$(\hat{w}{uv}\hat{w}{vw}\hat{w}_{wu})^{1/3}$$ is the geometric mean of the normalized weights.
In the case where there are as many triangles as in the hypothetical clique, the clustering coefficent is 1.
For directed, weighted graphs, the clustering component is typically defined as the number of directed triangles from divided by the number of possible directed triangles from . Let represent the number of actual directed triangles in the graph, and let denote the total number of possible directed triangles that could be formed, given the edges connecting to in either direction. Then the clustering coefficient is:
In the case where no self-interactions or bilateral edges are present, this simplifies to:
$$c_u = \frac{1}{deg_{out}(u)deg_{in}(u)} \sum_{vw} (\hat{w}{uv} \hat{w}{uw} \hat{w}_{vw})^{1/3},$$
which is similar to the undirected case, with the exception of the denominator.
Note that, for blockchain applications, both self-loops and bilateral edges are common, and this is a heuristic approximation to the referenced definition. In particular, it overcounts the number of possible triangles in the denominator when bilateral edges are present.
Using the preceding heuristic, it is straightforward to efficiently extract the clustering coefficients using a SQL dialect like Trino or SparkSQL. To illustrate, we will calculate the heuristic clustering coefficient for each node in the set of all Ethereum addresses that were subject to the Layer0 sybil mitigation effort. These are all Ethereum addresses that interacted with the Layer0 protocol through a smart contract prior to May 1, 2024.
The data analytics platform Flipside has a powerful free account which allows this SQL extraction. The Flipside query here performs the extraction in feasible time. Note that the QUALIFY clause is used to chunk the output data to the limitations of the free account, which allows at most 100000 record download at a time.
This article described a heuristic for estimating the directed, weighted clustering coefficent feasibly in SQL. This is a feature which can be used, for example, in machine learning models for sybil detection.
This note estimates the weighted directed clustering coefficient. Specifically for the case of sybil detection, unweighted directed coefficients can be of potentially greater utility. In the case where there are many small transaction amounts in the neighborhood of , which is often the case in a sybil attack, small transaction amounts can reduce the value of the numerator and the clustering coefficient.

For a directed graph, the connected subgraph must be a cycle.

In social networks, the clustering coefficient is used to find “small world” networks, i.e. communities that are highly interconnected. The use of clustering coefficients in this way is very similar to community detection used in sybil mitigation, in particular the Louvain Community Detection algorithm. The Louvain algorithm was used effectively to mitigate sybils in the first Arbitrum airdrop.
For the case of a weighted, undirected graph, the definition of the clustering coefficient is not standardized, but a common definition is average weight of all “triangles”:
$$c_u = \frac{1}{deg(u)(deg(u)-1))} \sum_{vw} (\hat{w}{uv} \hat{w}{uw} \hat{w}_{vw})^{1/3}.$$
Here, the weights are normalized by dividing by the maximum edge weight in the neighborhood. Note:
The denominator is the number of triangles if the neighborhood is fully connected (a clique).
The summand $$(\hat{w}{uv}\hat{w}{vw}\hat{w}_{wu})^{1/3}$$ is the geometric mean of the normalized weights.
In the case where there are as many triangles as in the hypothetical clique, the clustering coefficent is 1.
For directed, weighted graphs, the clustering component is typically defined as the number of directed triangles from divided by the number of possible directed triangles from . Let represent the number of actual directed triangles in the graph, and let denote the total number of possible directed triangles that could be formed, given the edges connecting to in either direction. Then the clustering coefficient is:
In the case where no self-interactions or bilateral edges are present, this simplifies to:
$$c_u = \frac{1}{deg_{out}(u)deg_{in}(u)} \sum_{vw} (\hat{w}{uv} \hat{w}{uw} \hat{w}_{vw})^{1/3},$$
which is similar to the undirected case, with the exception of the denominator.
Note that, for blockchain applications, both self-loops and bilateral edges are common, and this is a heuristic approximation to the referenced definition. In particular, it overcounts the number of possible triangles in the denominator when bilateral edges are present.
Using the preceding heuristic, it is straightforward to efficiently extract the clustering coefficients using a SQL dialect like Trino or SparkSQL. To illustrate, we will calculate the heuristic clustering coefficient for each node in the set of all Ethereum addresses that were subject to the Layer0 sybil mitigation effort. These are all Ethereum addresses that interacted with the Layer0 protocol through a smart contract prior to May 1, 2024.
The data analytics platform Flipside has a powerful free account which allows this SQL extraction. The Flipside query here performs the extraction in feasible time. Note that the QUALIFY clause is used to chunk the output data to the limitations of the free account, which allows at most 100000 record download at a time.
This article described a heuristic for estimating the directed, weighted clustering coefficent feasibly in SQL. This is a feature which can be used, for example, in machine learning models for sybil detection.
This note estimates the weighted directed clustering coefficient. Specifically for the case of sybil detection, unweighted directed coefficients can be of potentially greater utility. In the case where there are many small transaction amounts in the neighborhood of , which is often the case in a sybil attack, small transaction amounts can reduce the value of the numerator and the clustering coefficient.
<100 subscribers
<100 subscribers
No activity yet