Skip to main content

Managing stored data with I/O managers

I/O managers in Dagster allow you to keep the code for data processing separate from the code for reading and writing data. This reduces repetitive code and makes it easier to change where your data is stored.

In many Dagster pipelines, assets can be broken down as the following steps:

  1. Reading data a some data store into memory
  2. Applying in-memory transform
  3. Writing the transformed data to a data store

For assets that follow this pattern, an I/O manager can streamline the code that handles reading and writing data to and from a source.

Prerequisites

To follow the steps in this guide, you'll need familiarity with:

Before you begin

I/O managers aren't required to use Dagster, nor are they the best option in all scenarios. If you find yourself writing the same code at the start and end of each asset to load and store data, an I/O manager may be useful. For example:

  • You have assets that are stored in the same location and follow a consistent set of rules to determine the storage path
  • You have assets that are stored differently in local, staging, and production environments
  • You have assets that load upstream dependencies into memory to do the computation

I/O managers may not be the best fit if:

  • You want to run SQL queries that create or update a table in a database
  • Your pipeline manages I/O on its own by using other libraries/tools that write to storage
  • Your assets won't fit in memory, such as a database table with billions of rows

As a general rule, if your pipeline becomes more complicated in order to use I/O managers, it's likely that I/O managers aren't a good fit. In these cases you should use deps to define dependencies.

Using I/O managers in assets

Consider the following example, which contains assets that construct a DuckDB connection object, read data from an upstream table, apply some in-memory transform, and write the result to a new table in DuckDB:

Loading...

Using an I/O manager would remove the code that reads and writes data from the assets themselves, instead delegating it to the I/O manager. The assets would be left only with the code that applies transformations or retrieves the initial CSV file.

Loading...

To load upstream assets using an I/O manager, specify the asset as an input parameter to the asset function. In this example, the DuckDBPandasIOManager I/O manager will read the DuckDB table with the same name as the upstream asset (raw_sales_data) and pass the data to clean_sales_data as a Pandas DataFrame.

To store data using an I/O manager, return the data in the asset function. The returned data must be a valid type. This example uses Pandas DataFrames, which the DuckDBPandasIOManager will write to a DuckDB table with the same name as the asset.

Refer to the individual I/O manager documentation for details on valid types and how they store data.

Swapping data stores

With I/O managers, swapping data stores consists of changing the implementation of the I/O manager. The asset definitions, which only contain transformational logic, won't need to change.

In the following example, a Snowflake I/O manager replaced the DuckDB I/O manager.

Loading...

Built-in I/O managers

Dagster offers built-in library implementations for I/O managers for popular data stores and in-memory formats.

NameDescription
FilesystemIOManagerDefault I/O manager. Stores outputs as pickle files on the local file system.
InMemoryIOManagerStores outputs in memory. Primarily useful for unit testing.
S3PickleIOManagerStores outputs as pickle files in Amazon Web Services S3.
ConfigurablePickledObjectADLS2IOManagerStores outputs as pickle files in Azure ADLS2.
GCSPickleIOManagerStores outputs as pickle files in Google Cloud Platform GCS.
BigQueryPandasIOManagerStores Pandas DataFrame outputs in Google Cloud Platform BigQuery.
BigQueryPySparkIOManagerStores PySpark DataFrame outputs in Google Cloud Platform BigQuery.
SnowflakePandasIOManagerStores Pandas DataFrame outputs in Snowflake.
SnowflakePySparkIOManagerStores PySpark DataFrame outputs in Snowflake.
DuckDBPandasIOManagerStores Pandas DataFrame outputs in DuckDB.
DuckDBPySparkIOManagerStores PySpark DataFrame outputs in DuckDB.
DuckDBPolarsIOManagerStores Polars DataFrame outputs in DuckDB.

Next steps