Add optional OTel service to the Airflow Helm Chart#64902
Conversation
jason810496
left a comment
There was a problem hiding this comment.
I didn’t look into this thoroughly, but there might be some concerns based on a high-level overview.
Just FYI that we had a discussion about whether to use Kustomize for this kind of optional feature for better long-term maintainability in https://apache-airflow.slack.com/archives/C027H098M1C/p1770794021001679.
Though we haven’t settled on the release process and the concrete structure if we go with the Kustomize approach.
|
@jason810496 Thank you, I wasn't aware of this. Airflow needs to talk directly to the But the 3 observability backends, don't need to interact with Airflow. They are very good example candidates for As I understand from the Slack discussion, there was a consensus on using After this PR, I would like to add integration tests that use OTel and the backends. I don't think setting them up via I can move forward with the changes. |
jscheffl
left a comment
There was a problem hiding this comment.
I do not think it is a good idea adding more components to the Airflow chart. There are better and known charts for Prometheus/Grafana and such. You should rather reference them instead of adding more complexity to ours
|
@jscheffl What about the |
Miretpl
left a comment
There was a problem hiding this comment.
I only checked the Helm chart-related part. I would recommend splitting it from this PR, too.
Additionally, not in the comments, but the whole helm chart part has no tests addition, when it should have.
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds OpenTelemetry-driven observability (collector + optional backends) to the dev/Kubernetes workflow and Helm chart configuration, with flags to enable traces and/or metrics.
Changes:
- Extend Helm chart values/schema and templates to support an optional OpenTelemetry Collector and Airflow OTel configuration.
- Add CI/dev Kubernetes manifests for Jaeger/Prometheus/Grafana and expose them via NodePorts in kind.
- Update Breeze and test utilities to manage additional forwarded ports and to deploy observability backends based on
--setflags.
Reviewed changes
Copilot reviewed 17 out of 17 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| scripts/ci/prek/lint_json_schema.py | Update YAML loading/validation to support multi-document YAML. |
| scripts/ci/kubernetes/observability/kustomization.yaml | Add kustomize entrypoint for observability manifests. |
| scripts/ci/kubernetes/observability/jaeger.yaml | Add Jaeger all-in-one Deployment/Service for dev/CI. |
| scripts/ci/kubernetes/observability/prometheus.yaml | Add Prometheus config + Deployment/Service for dev/CI scraping. |
| scripts/ci/kubernetes/observability/grafana.yaml | Add Grafana provisioning + Deployment/Service for dev/CI dashboards. |
| scripts/ci/kubernetes/nodeport.yaml | Expose Jaeger/Prometheus/Grafana via NodePorts. |
| scripts/ci/kubernetes/kind-cluster-conf.yaml | Map additional NodePorts to localhost via kind port mappings. |
| kubernetes-tests/tests/kubernetes_tests/test_base.py | Switch API server port env var name used by k8s tests. |
| dev/breeze/tests/test_kubernetes_commands.py | Add unit tests for parsing OTel --set flags. |
| dev/breeze/src/airflow_breeze/utils/kubernetes_utils.py | Allocate/propagate new forwarded ports and print backend URLs. |
| dev/breeze/src/airflow_breeze/commands/kubernetes_commands.py | Apply observability manifests after deploy based on parsed OTel flags. |
| chart/values.yaml | Add OTel collector values and wire OTel vs statsd settings in airflow.cfg. |
| chart/values.schema.json | Extend JSON schema for OTel collector image and ports. |
| chart/templates/otel-collector/otel-collector-service.yaml | Add Service for the optional OTel collector. |
| chart/templates/otel-collector/otel-collector-deployment.yaml | Add Deployment for the optional OTel collector. |
| chart/templates/configmaps/otel-collector-configmap.yaml | Add OTel collector config (receivers/exporters/pipelines). |
| chart/templates/_helpers.yaml | Add OTel env vars and helper for OTel collector image string. |
d1da7d0 to
89f2712
Compare
|
@Miretpl Thank you for the review! I removed all the I'm going to address your comments and also add tests. |
|
Green CI.
@jscheffl I can look into the backport. |
Backport failed to create: chart/v1-2x-test. View the failure log Run detailsNote: As of Merging PRs targeted for Airflow 3.X In matter of doubt please ask in #release-management Slack channel.
You can attempt to backport this manually by running: cherry_picker 535e3cc chart/v1-2x-testThis should apply the commit to the chart/v1-2x-test branch and leave the commit in conflict state marking After you have resolved the conflicts, you can continue the backport process by running: cherry_picker --continueIf you don't have cherry-picker installed, see the installation guide. |
|
@xBis7 no problem. Feel free to mention me when the backport is ready |
* add otel to helm chart * use Kustomize for grafana, jaeger, prometheus * enable specific service per flag + unit test * remove grafana, jaeger and prometheus kustomization logic * traces enabled and metrics disabled, by default * remove otelCollector.enabled flag * add statsd comments about otel metrics overriding the config * make OTEL_METRIC_EXPORT_INTERVAL configurable and provide default value + entry in the values.schema.json * remove hardcoded value for metrics otel_port in values.yaml * add option to override the configmap * add otelCollector.args and make the config.yml file as the default argument * rename extraAnnotations to annotations in otel-collector-service.yaml * parameterize the readiness and liveness probe values * remove prometheus from the configmap * update the default value for OTEL_TRACES_EXPORTER * fix tests in airflow_aux + otel-collector-serviceaccount.yaml * fix spellcheck errors in docs * fix tests in security * otel collector unit tests + networkpolicy file * values.schema.json cleanup * add a minimum to all integer configs in values.schema.json * fix heading comments * change config default to ~ from empty string * fix static check error
* add otel to helm chart * use Kustomize for grafana, jaeger, prometheus * enable specific service per flag + unit test * remove grafana, jaeger and prometheus kustomization logic * traces enabled and metrics disabled, by default * remove otelCollector.enabled flag * add statsd comments about otel metrics overriding the config * make OTEL_METRIC_EXPORT_INTERVAL configurable and provide default value + entry in the values.schema.json * remove hardcoded value for metrics otel_port in values.yaml * add option to override the configmap * add otelCollector.args and make the config.yml file as the default argument * rename extraAnnotations to annotations in otel-collector-service.yaml * parameterize the readiness and liveness probe values * remove prometheus from the configmap * update the default value for OTEL_TRACES_EXPORTER * fix tests in airflow_aux + otel-collector-serviceaccount.yaml * fix spellcheck errors in docs * fix tests in security * otel collector unit tests + networkpolicy file * values.schema.json cleanup * add a minimum to all integer configs in values.schema.json * fix heading comments * change config default to ~ from empty string * fix static check error
* add otel to helm chart * use Kustomize for grafana, jaeger, prometheus * enable specific service per flag + unit test * remove grafana, jaeger and prometheus kustomization logic * traces enabled and metrics disabled, by default * remove otelCollector.enabled flag * add statsd comments about otel metrics overriding the config * make OTEL_METRIC_EXPORT_INTERVAL configurable and provide default value + entry in the values.schema.json * remove hardcoded value for metrics otel_port in values.yaml * add option to override the configmap * add otelCollector.args and make the config.yml file as the default argument * rename extraAnnotations to annotations in otel-collector-service.yaml * parameterize the readiness and liveness probe values * remove prometheus from the configmap * update the default value for OTEL_TRACES_EXPORTER * fix tests in airflow_aux + otel-collector-serviceaccount.yaml * fix spellcheck errors in docs * fix tests in security * otel collector unit tests + networkpolicy file * values.schema.json cleanup * add a minimum to all integer configs in values.schema.json * fix heading comments * change config default to ~ from empty string * fix static check error
This patch is adding an
otel-collectorto the Helm chart.I've added 2 separate flags for enabling traces and metrics. OTel is the only supported backend for traces, and so the traces flag is enabled by default. But that's not the case with metrics, and they need to be manually enabled. When the user enables the otel metrics,
statsdis disabled in the airflow config so that otel will be used instead.Was generative AI tooling used to co-author this PR?
Claude Sonnet 4.6 Extended
{pr_number}.significant.rst, in airflow-core/newsfragments. You can add this file in a follow-up commit after the PR is created so you know the PR number.