In this article, we'll explore how to build a Solana data pipeline in Python to analyze Jito fees. We'll use Solana-py to examine block and transaction-level data for a single slot using a free Solana RPC. A full notebook example with extensive comments is available to accompany this post. I highly encourage you to read through the notebook and experiment with the code.
The Solana data ecosystem is still in its early stages, offering limited tools and resources for data analysis. Analyzing Solana data presents unique challenges due to the lack of transaction standardization and the high resource costs required to sift through data generated by Solana's 1000+ TPS speed.
Before diving into the technical details, it's crucial to understand the importance of planning your ETL (Extract, Transform, Load) pipeline. A well-thought-out pipeline not only simplifies the data extraction process but also makes the subsequent analysis more straightforward and efficient. Careful planning helps to identify the necessary data, the transformations required, and how the data will be loaded into the analysis environment. This preparation can save significant time and effort, especially when dealing with large datasets like those on Solana.
Solana data is stored in various sections of the Solana block object. For example, fees, compute units, and account balances are found in transactions.meta
, while account keys are stored in transaction.message
. The block number and timestamp are located in blockHeight
and blockTime
, respectively. Instead of storing each part of the block object separately and then combining them, we extract and combine all this data into a single dataframe, creating a unified dataset.
One challenge in extracting Solana data is dealing with multiple signatures, locked account states, and balance changes within each transaction. To analyze Jito fees, we need to isolate the set of Jito tip accounts within the transactions. We applied a transformation at the data extraction level to process each account key and pre/post balance as a separate row, simplifying the complexity. The tradeoff is an increase in the row count for each block by approximately five times. This technique is effective for analyzing a single block with a few thousand rows but may be less feasible for analyzing 10 million blocks with billions of rows.
The end results show some transaction-level insights for a single block (277533216), which had 2,588 transactions. Unlike Ethereum, where the top of the block is a contentious area, the same does not apply to the Solana slot. The chart below illustrates that some of the highest fees are paid in the 500-1000 transaction index range.
The next chart demonstrates that fees paid do not correlate with higher computational costs. In fact, the majority of higher fees are clustered in transactions with lower compute units. This indicates that the most highly contested states take very little of the overall compute cost in Solana blocks. The majority of transactions take less than 200,000 compute costs.
The jito tips for this block are distributed throughout the block. Looking at the next chart, it looks like there is a small cluster of jito tips towards the end of the block (highest transaction indexes), but all of the high outlier jito tips are dispersed throughout the block.
Finally we also compare the jito tips being for transactions with varying compute costs. This chart shows that there is a skew towards higher jito fees for less compute units with a large cluster of fees paid for less than 25,000 compute cost. It is also observed that Jito fees are also widely distributed within the the block, but it is not clear given this sample data why that is.
Initial findings reveal interesting transaction-level insights for a single block. For instance, higher transaction fees in Solana are not necessarily correlated with higher computational costs, and Jito fees are distributed throughout the block with a notable cluster towards the end. It remains to be seen whether these transaction-level insights hold for analysis over larger blocks, but the probability that this pseudo-random block selected is an outlier block is low.
These insights highlight an ETL workflow to further understand the dynamics of Solana transactions, fees paid (including Jito), and compute costs. With the current data landscape, it is no small feat to analyze even a single week's worth of blocks (216000*7 = ~1.5m blocks). With an average blocksize of 1000 transactions, this equates to 1.5b transactions.