Dagster & Delta Lake
About this integration
Delta Lake is a great storage format for Dagster workflows. With this integration, you can use the Delta Lake I/O Manager to read and write your Dagster assets.
Here are some of the benefits that Delta Lake provides Dagster users:
- Native PyArrow integration for lazy computation of large datasets
- More efficient querying with file skipping with Z Ordering and liquid clustering
- Built-in vacuuming to remove unnecessary files and versions
- ACID transactions for reliable writes
- Smooth versioning integration (versions can be use to trigger downstream updates).
- Surfacing table stats based on the file statistics
Installation
pip install dagster-deltalake
pip install dagster-deltalake-pandas
pip install dagster-deltalake-polars
About Delta Lake
Delta Lake is an open source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs for Scala, Java, Rust, and Python.