Dagster & GCP Dataproc
Using this integration, you can manage and interact with Google Cloud Platform's Dataproc service directly from Dagster. This integration allows you to create, manage, and delete Dataproc clusters, and submit and monitor jobs on these clusters.
Installation
pip install dagster-gcp
Examples
from dagster_gcp import DataprocResource
import dagster as dg
dataproc_resource = DataprocResource(
project_id="your-gcp-project-id",
region="your-gcp-region",
cluster_name="your-cluster-name",
cluster_config_yaml_path="path/to/your/cluster/config.yaml",
)
@dg.asset
def my_dataproc_asset(dataproc: DataprocResource):
client = dataproc.get_client()
job_details = {
"job": {
"placement": {"clusterName": dataproc.cluster_name},
}
}
client.submit_job(job_details)
defs = dg.Definitions(
assets=[my_dataproc_asset], resources={"dataproc": dataproc_resource}
)
About Google Cloud Platform Dataproc
Google Cloud Platform's Dataproc is a fully managed and highly scalable service for running Apache Spark, Apache Hadoop, and other open source data processing frameworks. Dataproc simplifies the process of setting up and managing clusters, allowing you to focus on your data processing tasks without worrying about the underlying infrastructure. With Dataproc, you can quickly create clusters, submit jobs, and monitor their progress, all while benefiting from the scalability and reliability of Google Cloud Platform.