Managing stored data with I/O managers
I/O managers in Dagster allow you to keep the code for data processing separate from the code for reading and writing data. This reduces repetitive code and makes it easier to change where your data is stored.
In many Dagster pipelines, assets can be broken down as the following steps:
- Reading data a some data store into memory
- Applying in-memory transform
- Writing the transformed data to a data store
For assets that follow this pattern, an I/O manager can streamline the code that handles reading and writing data to and from a source.
Before you begin
I/O managers aren't required to use Dagster, nor are they the best option in all scenarios. If you find yourself writing the same code at the start and end of each asset to load and store data, an I/O manager may be useful. For example:
- You have assets that are stored in the same location and follow a consistent set of rules to determine the storage path
- You have assets that are stored differently in local, staging, and production environments
- You have assets that load upstream dependencies into memory to do the computation
I/O managers may not be the best fit if:
- You want to run SQL queries that create or update a table in a database
- Your pipeline manages I/O on its own by using other libraries/tools that write to storage
- Your assets won't fit in memory, such as a database table with billions of rows
As a general rule, if your pipeline becomes more complicated in order to use I/O managers, it's likely that I/O managers aren't a good fit. In these cases you should use deps
to define dependencies.
Using I/O managers in assets
Consider the following example, which contains assets that construct a DuckDB connection object, read data from an upstream table, apply some in-memory transform, and write the result to a new table in DuckDB:
Loading...
Using an I/O manager would remove the code that reads and writes data from the assets themselves, instead delegating it to the I/O manager. The assets would be left only with the code that applies transformations or retrieves the initial CSV file.
Loading...
To load upstream assets using an I/O manager, specify the asset as an input parameter to the asset function. In this example, the DuckDBPandasIOManager
I/O manager will read the DuckDB table with the same name as the upstream asset (raw_sales_data
) and pass the data to clean_sales_data
as a Pandas DataFrame.
To store data using an I/O manager, return the data in the asset function. The returned data must be a valid type. This example uses Pandas DataFrames, which the DuckDBPandasIOManager
will write to a DuckDB table with the same name as the asset.
Refer to the individual I/O manager documentation for details on valid types and how they store data.
Swapping data stores
With I/O managers, swapping data stores consists of changing the implementation of the I/O manager. The asset definitions, which only contain transformational logic, won't need to change.
In the following example, a Snowflake I/O manager replaced the DuckDB I/O manager.
Loading...
Built-in I/O managers
Dagster offers built-in library implementations for I/O managers for popular data stores and in-memory formats.
Name | Description |
---|---|
FilesystemIOManager | Default I/O manager. Stores outputs as pickle files on the local file system. |
InMemoryIOManager | Stores outputs in memory. Primarily useful for unit testing. |
S3PickleIOManager | Stores outputs as pickle files in Amazon Web Services S3. |
ConfigurablePickledObjectADLS2IOManager | Stores outputs as pickle files in Azure ADLS2. |
GCSPickleIOManager | Stores outputs as pickle files in Google Cloud Platform GCS. |
BigQueryPandasIOManager | Stores Pandas DataFrame outputs in Google Cloud Platform BigQuery. |
BigQueryPySparkIOManager | Stores PySpark DataFrame outputs in Google Cloud Platform BigQuery. |
SnowflakePandasIOManager | Stores Pandas DataFrame outputs in Snowflake. |
SnowflakePySparkIOManager | Stores PySpark DataFrame outputs in Snowflake. |
DuckDBPandasIOManager | Stores Pandas DataFrame outputs in DuckDB. |
DuckDBPySparkIOManager | Stores PySpark DataFrame outputs in DuckDB. |
DuckDBPolarsIOManager | Stores Polars DataFrame outputs in DuckDB. |
Next steps
- Learn to connect databases with resources
- Learn to connect APIs with resources