Dagster & LakeFS
By integrating with lakeFS, a big data scale version control system, you can leverage the versioning capabilities of lakeFS to track changes to your data. This integration allows you to have a complete lineage of your data, from the initial raw data to the transformed and processed data, making it easier to understand and reproduce data transformations.
With lakeFS and Dagster integration, you can ensure that data flowing through your Dagster jobs is easily reproducible. lakeFS provides a consistent view of your data across different versions, allowing you to troubleshoot pipeline runs and ensure consistent results.
Furthermore, with lakeFS branching capabilities, Dagster jobs can run on separate branches without additional storage costs, creating isolation and allowing promotion of only high-quality data to production leveraging a CI/CD pipeline for your data.
Installation
pip install lakefs-client
Example
import lakefs_client
from lakefs_client import models
from lakefs_client.client import LakeFSClient
import dagster as dg
logger = dg.get_dagster_logger()
configuration = lakefs_client.Configuration()
configuration.username = "AAAA"
configuration.password = "BBBBB"
configuration.host = "https://my-org.us-east-1.lakefscloud.io"
@dg.asset
def create_branch(client: dg.ResourceParam[LakeFSClient]):
branch_id = client.branches.create_branch(
repository="test-repo",
branch_creation=models.BranchCreation(name="experiment", source="main"),
)
logger.info(branch_id)
@dg.asset(deps=[create_branch])
def list_branches(client: dg.ResourceParam[LakeFSClient]):
list_branches = client.branches.list_branches(repository="test-repo")
logger.info(list_branches)
defs = dg.Definitions(
assets=[create_branch, list_branches],
resources={"client": LakeFSClient(configuration)},
)
About lakeFS
lakeFS is on a mission to simplify the lives of data engineers, data scientists and analysts providing a data version control platform at scale.