Skip to content

Fix OTel metrics lost in forked task subprocesses#64692

Closed
MichaelRBlack wants to merge 1 commit into
apache:mainfrom
MichaelRBlack:fix/otel-fork-meter-provider-reset
Closed

Fix OTel metrics lost in forked task subprocesses#64692
MichaelRBlack wants to merge 1 commit into
apache:mainfrom
MichaelRBlack:fix/otel-fork-meter-provider-reset

Conversation

@MichaelRBlack

Copy link
Copy Markdown
Contributor

Summary

Task-level OTel metrics (e.g. ti.finish) are silently dropped in forked task subprocesses because the OTel Python SDK's Once() guard on set_meter_provider() survives fork().

Root cause: stats.py correctly detects PID mismatches after fork and calls otel_logger.get_otel_logger() to re-initialize. This creates a fresh MeterProvider and calls metrics.set_meter_provider(), but the SDK's _METER_PROVIDER_SET_ONCE._done = True flag inherited from the parent blocks the call. The child ends up with the parent's stale provider whose PeriodicExportingMetricReader background thread is dead after fork.

Fix: Reset the SDK's provider state in get_otel_logger() before calling set_meter_provider(). Since stats.py only calls the factory after detecting a PID mismatch, this reset only runs in forked children that need a fresh provider.

Closes #64690

Test plan

  • Added unit test that simulates Once._done = True (forked child state) and verifies get_otel_logger() successfully sets a new MeterProvider
  • Manual: Deploy and confirm ti.finish metrics appear in Grafana
  • Manual: Confirm "Overriding of current MeterProvider is not allowed" warning no longer appears in task logs

🤖 Generated with Claude Code

The OTel Python SDK uses a `Once()` guard on `set_meter_provider()` that
only allows it to succeed once per process. When Airflow forks a subprocess
for task execution, the child inherits the parent's `Once._done = True`
state, so the Stats re-initialization (which correctly detects the PID
mismatch) silently fails to set a new MeterProvider. The child ends up
with the parent's stale provider whose export thread is dead after fork,
causing task-level metrics like `ti.finish` to never reach the OTel
collector.

The fix resets the SDK's provider guard in `get_otel_logger()` before
calling `set_meter_provider()`. Since `stats.py` only calls the factory
after detecting a PID mismatch, this reset only runs in forked children
that need a fresh provider.

Closes: apache#64690
@potiuk

potiuk commented Apr 6, 2026

Copy link
Copy Markdown
Member

@MichaelRBlack This PR has a few issues that need to be addressed before it can be reviewed — please see our Pull Request quality criteria.

Issues found:

  • Merge conflicts: This PR has merge conflicts with the main branch. Your branch is 406 commits behind main. Please rebase your branch (git fetch origin && git rebase origin/main), resolve the conflicts, and push again. See contributing quick start.

Note: Your branch is 406 commits behind main. Some check failures may be caused by changes in the base branch rather than by your PR. Please rebase your branch and push again to get up-to-date CI results.

What to do next:

  • The comment informs you what you need to do.
  • Fix each issue, then mark the PR as "Ready for review" in the GitHub UI - but only after making sure that all the issues are fixed.
  • There is no rush — take your time and work at your own pace. We appreciate your contribution and are happy to wait for updates.
  • Maintainers will then proceed with a normal review.

There is no rush — take your time and work at your own pace. We appreciate your contribution and are happy to wait for updates. If you have questions, feel free to ask on the Airflow Slack.


Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you.

@eladkal eladkal added this to the Airflow 3.2.1 milestone Apr 7, 2026
@MichaelRBlack

Copy link
Copy Markdown
Contributor Author

Closing this PR — the fix was already merged via #64703. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OTel task-level metrics (ti.finish, ti.start) lost — forked processes and KubernetesExecutor

3 participants