Skip to main content

Ingesting data with Dagster

note

This guide focuses on batch data ingestion, as streaming data ingestion doesn't typically rely on an orchestrator to kick off or coordinate computations. However, streaming data assets can still be represented in Dagster for lineage purposes.

Dagster is often used to orchestrate the ingestion of data into a data warehouse or data lake, where it can be queried and transformed. To ingest data with Dagster, you can use pre-built connectors or write your own custom code.

Prerequisites

To follow this guide, you'll need:

How Dagster supports data ingestion

As a data orchestrator, Dagster helps with data ingestion as it can:

  • Automatically kick off computations that ingest data, thus removing the need for manual intervention
  • Coordinate data ingestion with downstream data transformation, such as rebuilding a set of dbt models after upstream data is updated
  • Represent ingested data assets in an asset graph, which enables understanding what ingested data exists, how ingested data is used, and where data is ingested from

Orchestrating data ingestion tools

Dagster currently integrates with the following data ingestion tools, enabling you to sync diverse data sources into data warehouse tables using pre-built connectors:

Writing custom data ingestion pipelines

Using a language like Python to write code for data ingestion into a platform is also a common approach. This is useful when you have unique data ingestion requirements that aren't addressed by existing tools, or when you prefer to keep your platform streamlined without adding new tools.

For example, imagine there's a CSV file of counties on the internet and you want to load it into your Snowflake data warehouse as a table. To do this, you might define a Dagster asset that represents that table in your warehouse. The asset's materialization function fetches data from the internet and loads it into that table:

Loading...

Next steps