# Ethereum analytics with BigQuery

By [Nick Johnson](https://paragraph.com/@nick-2) · 2021-11-27

---

While Dune Analytics has rightly gotten a lot of positive attention recently for providing an easy-to-use platform for doing analytics over Ethereum datasets, you may not know that Google also provides a comprehensive dataset of Ethereum data on its large-scale analytics querying platform, [BigQuery](https://cloud.google.com/bigquery/).

Not only are the datasets comprehensive - even including a table of call execution traces for every Ethereum transaction - they also benefit from BigQuery’s distributed nature, meaning that even large and complex queries can be executed quickly. I used BigQuery for both efficiently calculating the ENS airdrop amounts and for our [ENS dashboard](https://datastudio.google.com/s/nkFLZVr363c), and I thought people might find a few tips on how to put the data to work useful.

Where to find it
================

The Ethereum data can be found in two public projects in the [BigQuery console](https://console.cloud.google.com/bigquery). The first is the `crypto_ethereum` dataset under the `bigquery-public-data` project, which contains general Ethereum-wide datasets: Tables of blocks, logs, transactions, balances, etc.

The second project is `blockchain-etl`, which contains numerous public datasets for Ethereum projects. For example, `blockchain-etl.ethereum_ens` contains tables of events for most major contracts in ENS. If you’re wanting to produce analytics or do data mining for a specific project, look here first.

How to query it
===============

If you’ve used another analytics product like Dune Analytics, this will mostly be pretty familiar; BigQuery uses an SQL dialect just like Dune. For example, if you wanted to get a list of every address used as an ENS resolver, you could do it like this:

    SELECT
      resolver,
      COUNT(*) AS count
    FROM
      `blockchain-etl.ethereum_ens.ENSRegistryWithFallback_event_NewResolver`
    GROUP BY
      resolver
    ORDER BY
      count DESC;
    

Likewise for the `crypto_ethereum` dataset. Want a leaderboard of accounts that have spent the most on transaction fees? Here you go:

    SELECT
      from_address,
      SUM(CAST(receipt_gas_used AS numeric) * CAST(COALESCE(gas_price, receipt_effective_gas_price) AS numeric)) / 1e18 AS total_fees
    FROM
      `bigquery-public-data.crypto_ethereum.transactions`
    GROUP BY
      from_address
    ORDER BY
      total_fees DESC
    LIMIT
      100;
    

Tips
====

Aggregating Deltas
------------------

A lot of Ethereum event data has a common format: each event describes a change to some underlying state. Often what you want, however, is the state itself, rather than the delta. For example, ERC20 transfers are logged as a `Transfer` event, which includes the from address, to address, and amount transferred, but often what you want is a balance sheet of all account balances. For this sort of query, simple aggregates generally suffice. This query, for example, returns a balance sheet for $ENS:

    SELECT
      address,
      SUM(value) AS balance
    FROM (
      SELECT
        token_address,
        from_address AS address,
        -CAST(value AS bigdecimal) / 1e18 AS value
      FROM
        `bigquery-public-data.crypto_ethereum.token_transfers`
      UNION ALL
      SELECT
        token_address,
        to_address AS address,
        CAST(value AS bigdecimal) / 1e18 AS value
      FROM
        `bigquery-public-data.crypto_ethereum.token_transfers` )
    WHERE
      token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72'
    GROUP BY
      address
    ORDER BY
      balance DESC;
    

Note we’re including the `token_transfers` table here twice, once with negative values for the sender, and once with positive values for the recipient.

Creating a time-series from deltas
----------------------------------

What if you want a balance history - the balance of an address after each transaction? This is where window functions (also called analytic functions) come in incredibly useful:

    SELECT
      block_timestamp AS timestamp,
      address,
      SUM(value) OVER (PARTITION BY address ORDER BY block_timestamp ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS balance
    FROM (
      SELECT
        block_timestamp,
        token_address,
        from_address AS address,
        -CAST(value AS bigdecimal) / 1e18 AS value
      FROM
        `bigquery-public-data.crypto_ethereum.token_transfers`
      UNION ALL
      SELECT
        block_timestamp,
        token_address,
        to_address AS address,
        CAST(value AS bigdecimal) / 1e18 AS value
      FROM
        `bigquery-public-data.crypto_ethereum.token_transfers` )
    WHERE
      token_address = '0xc18360217d8f7ab5e7c516566761ea12ce7f9d72'
    ORDER BY
      balance DESC;
    

Note the similarities to the previous query - but now we’re no longer grouping the results. Here, the SUM function isn’t aggregating records, it’s summing over other rows in the input set, totalling up the values of all previous transactions for that account. Note how the OVER clause specifies how to partition the rows (by address) and how to order them (by timestamp) and which rows to sum (from the start to the current row) - all resulting in a correct total for our use-case. The result will be one row for every token transaction, but with the balance after the transaction rather than the amount transacted.

Creating deltas from a time series
----------------------------------

What if you want to do the reverse - you have the new value, but instead you want the delta? We see this with ENS registration and renewal events; the contract outputs the expiry time, and we may want to know how long the name was registered or renewed for. Analytics queries come in useful here too, this time with the LAG function, which references a previous row in the input set:

    SELECT
      block_timestamp,
      name,
      CAST(expires AS int64) - COALESCE(LAG(CAST(expires AS int64), 1) OVER(PARTITION BY name ORDER BY block_timestamp), UNIX_SECONDS(block_timestamp)) AS duration
    FROM (
      SELECT
        block_timestamp,
        name,
        expires
      FROM
        `blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRegistered`
      UNION ALL
      SELECT
        block_timestamp,
        name,
        expires
      FROM
        `blockchain-etl.ethereum_ens.ETHRegistrarController3_event_NameRenewed`)
    

LAG takes the expression to retrieve and how many rows back to retrieve it - and the OVER expression specifies how the rows should be partitioned and ordered. We also use COALESCE to provide a dynamic value for when a row is the first in the result set (eg, it’s being registered for the first time).

It’s worth noting that this won’t give totally accurate results; if a name expires and is reregistered, it will show the previous registration taking over from when the name expired, not from when it was reregistered. Fixing this is left as an exercise for the reader.

JavaScript user-defined functions
---------------------------------

Finally, what if you want to do something there isn’t a built-in function for - such as converting between a 256 bit int and its hex representation? Fortunately, BigQuery has you covered, with support for JavaScript user-defined functions, which can even import libraries:

    CREATE OR REPLACE FUNCTION `project.dataset.int_str_to_hash`(data STRING) RETURNS STRING LANGUAGE js
    OPTIONS (library=["gs://blockchain-etl-bigquery/ethers.js"]) AS R"""
    return '0x' + ethers.utils.bigNumberify(data).toHexString().slice(2).padStart(64, '0');
    """;
    

Extracting your own event data
------------------------------

All of this is great, but what if you want to do analytics on events that haven’t been conveniently extracted to a table and decoded by the Blockchain ETL team? If your dataset is widely useful, you may want to follow the instructions in [this post](https://medium.com/@ASvanevik/how-to-get-any-ethereum-smart-contract-into-bigquery-in-8-mins-bab5db1fdeee), to get your dataset added to the blockchain\_etl project. If you don’t have time to wait, or you’re working on something more bespoke, however, it’s possible to extract this data yourself using JavaScript UDFs - something I’ll go into in detail in my next post.

DIY
===

If you want your own hosted version of the Ethereum ETL data, you’re in luck: The team has posted the source publicly on GitHub, so you can reproduce the datasets yourself:

[

GitHub - blockchain-etl/ethereum-etl: Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

Python scripts for ETL (extract, transform and load) jobs for Ethereum blocks, transactions, ERC20 / ERC721 tokens, transfers, receipts, logs, contracts, internal transactions. Data is available in Google BigQuery https://goo.gl/oY5BCQ - blockchain-etl/ethereum-etl

https://github.com

![](https://storage.googleapis.com/papyrus_images/c5d67c2db692a954148395684cb5352cf82b13e0a42eab21dd2d9cd485e6899f.png)

](https://github.com/blockchain-etl/ethereum-etl)

---

*Originally published on [Nick Johnson](https://paragraph.com/@nick-2/ethereum-analytics-with-bigquery)*
