Skip to main content

Dagster & AWS Glue

The dagster-aws integration library provides the PipesGlueClient resource, enabling you to launch AWS Glue jobs directly from Dagster assets and ops. This integration allows you to pass parameters to Glue code while Dagster receives real-time events, such as logs, asset checks, and asset materializations, from the initiated jobs. With minimal code changes required on the job side, this integration is both efficient and easy to implement.

Installation

pip install dagster-aws

Examples

import boto3
from dagster_aws.pipes import (
PipesGlueClient,
PipesS3ContextInjector,
PipesS3MessageReader,
)

import dagster as dg


@dg.asset
def glue_pipes_asset(
context: dg.AssetExecutionContext, pipes_glue_client: PipesGlueClient
):
return pipes_glue_client.run(
context=context,
job_name="Example Job",
arguments={"some_parameter_value": "1"},
).get_materialize_result()


defs = dg.Definitions(
assets=[glue_pipes_asset],
resources={
"pipes_glue_client": PipesGlueClient(
client=boto3.client("glue", region_name="us-east-1"),
context_injector=PipesS3ContextInjector(
client=boto3.client("s3"),
bucket="my-bucket",
),
message_reader=PipesS3MessageReader(
client=boto3.client("s3"), bucket="my-bucket"
),
)
},
)

About AWS Glue

AWS Glue is a fully managed cloud service designed to simplify and automate the process of discovering, preparing, and integrating data for analytics, machine learning, and application development. It supports a wide range of data sources and formats, offering seamless integration with other AWS services. AWS Glue provides the tools to create, run, and manage ETL (Extract, Transform, Load) jobs, making it easier to handle complex data workflows. Its serverless architecture allows for scalability and flexibility, making it a preferred choice for data engineers and analysts who need to process and prepare data efficiently.