# Streaming Embedded OLAP Data Pipeline with LanceDB + DuckDB in Python

> Build a streaming embedded OLAP data pipeline locally from historical blockchain data in Python

**Published by:** [Defi, Data, Degen](https://paragraph.com/@evandekim/)
**Published on:** 2024-05-12
**Categories:** defi, data engineering, embedded olap, pipeline, lancedb, duckdb, ethereum, blockchain, on-chain data, polars, python, data streaming, sql, hypersync
**URL:** https://paragraph.com/@evandekim/streaming-embedded-olap-data-pipeline-with-lancedb-duckdb-in-python

## Content

IntroThis post goes over a simple solution to create an embedded streaming OLAP data pipeline that leverages historical blockchain data from Hypersync and mutable columnar storage format Lance. Using the new python package lancedb-tables, it is easy to create LanceDB tables, update without complex file management systems, and stream real time data. Python Data Ecosystem Python has been experiencing massive leaps and bounds in open source data standards via the Apache Arrow standard. Apache Arrow plays a significant role in standardizing the Python data ecosystem by providing a unified, language-agnostic columnar memory format for data. This standardization is crucial for several reasons:System Interoperability: Arrow's language-agnostic format enhances data exchange across diverse systems and languages, minimizing serialization costs between different data processing tools. For example, Arrow enables seamless transition between dataframe and SQL apis.Community and Ecosystem Growth: Wide adoption across languages fosters a robust developer community, driving continuous enhancements in Arrow and its libraries.Future-Proofing Data Engineering: Arrow standardizes data formats and processing methods, supporting large-scale analytics and future data engineering innovations in Python.Apache Arrow makes it possible to have seamless integration between LanceDB, DuckDB, and Polars.Data Pipeline StackHypersync - Query historical blockchain dataLanceDB - write data to a LanceDB tableDuckDB - OLAP Queries on the LanceDB tableHypersyncEnvio is a real-time indexing solution specifically designed for EVM-compatible blockchains. It provides developers with Hypersync, a more efficient way to sync and aggregate blockchain data compared to JSON-RPC, offering up to 20-100x faster syncing on historical data as a result. The following example utilizes the lancedb-tables package to sync a sample of 50,000 blocks from Hypersync to a local LanceDB table. Example code and instructions can be found in this repository. The Hypersync client is used to query blocks and transactions from blocks 19,825,000 to 19,850,000. Then running the back fill table example will then back fill the LanceDB table from blocks 19,800,000 to 19,825,000. Polars is used as an analytical preprocessing API of choice to perform a bit of data cleaning before writing data into the LanceDB table.LanceDBLanceDB contains a powerful writer that makes it very simple to read and write new Lance datasets. The key reason is that LanceDB decouples read and write operations. This is by design as all LanceDB files are read-only. New data writing operations create new fragments without affecting existing data, allowing for independent operation of read and write nodes in a distributed environment. This design ensures that read operations on current data remain unaffected by concurrent writes. This makes the Lance format behave as a mutable columnar storage format. In contrast, parquet is a immutable columnar storage format and requires additional overhead and code complexity to manage streaming data.DuckDBThe example notebook analytics.ipynb is an easy example that shows how easy it is to add OLAP functionality by using DuckDB on top of a LanceDB table. Notably using DuckDB to count the number of unique blocks is much faster than Polars, on a magnitude of about 10x.ConclusionThis was a short example of how to leverage a modern data engineering stack to build a local embedded OLAP streaming data pipeline with ease in a local environment. Follow up writings will show how to save this pipeline in a cloud environment and add more data sources.

## Publication Information

- [Defi, Data, Degen](https://paragraph.com/@evandekim/): Publication homepage
- [All Posts](https://paragraph.com/@evandekim/): More posts from this publication
- [RSS Feed](https://api.paragraph.com/blogs/rss/@evandekim): Subscribe to updates

## Optional

- [Collect as NFT](https://paragraph.com/@evandekim/streaming-embedded-olap-data-pipeline-with-lancedb-duckdb-in-python): Support the author by collecting this post
- [View Collectors](https://paragraph.com/@evandekim/streaming-embedded-olap-data-pipeline-with-lancedb-duckdb-in-python/collectors): See who has collected this post