Creating domain-specific languages with asset factories
Often in data engineering, you'll find yourself needing to create a large number of similar assets. For example:
- A set of database tables all have the same schema
- A set of files in a directory all have the same format
It's also possible you're serving stakeholders who aren't familiar with Python or Dagster. They may prefer interacting with assets using a domain-specific language (DSL) built on top of a configuration language such as YAML.
The asset factory pattern can solve both of these problems.
Prerequisites
Building an asset factory in Python
Let's imagine a team that often has to perform the same repetitive ETL task: download a CSV file from S3, run a basic SQL query on it, and then upload the result as a new file back to S3.
To automate this process, you might define an asset factory in Python like the following:
Loading...
The asset factory pattern is essentially a function that takes in some configuration and returns dg.Definitions
.
Configuring an asset factory with YAML
Now, the team wants to be able to configure the asset factory using YAML instead of Python, with a file like this:
Loading...
To implement this, parse the YAML file and use it to create the S3 resource and ETL jobs:
Loading...
Improving usability with Pydantic and Jinja
There are a few problems with the current approach:
- The YAML file isn't type-checked, so it's easy to make mistakes that will cause cryptic
KeyError
s - The YAML file contains secrets. Instead, it should reference environment variables.
To solve these problems, you can use Pydantic to define a schema for the YAML file and Jinja to template the YAML file with environment variables.
Here's what the new YAML file might look like. Note how Jinja templating is used to reference environment variables:
Loading...
And the Python implementation:
Loading...
Next steps
TODO