Fix OTel metrics lost in forked task subprocesses#64692
Conversation
The OTel Python SDK uses a `Once()` guard on `set_meter_provider()` that only allows it to succeed once per process. When Airflow forks a subprocess for task execution, the child inherits the parent's `Once._done = True` state, so the Stats re-initialization (which correctly detects the PID mismatch) silently fails to set a new MeterProvider. The child ends up with the parent's stale provider whose export thread is dead after fork, causing task-level metrics like `ti.finish` to never reach the OTel collector. The fix resets the SDK's provider guard in `get_otel_logger()` before calling `set_meter_provider()`. Since `stats.py` only calls the factory after detecting a PID mismatch, this reset only runs in forked children that need a fresh provider. Closes: apache#64690
|
@MichaelRBlack This PR has a few issues that need to be addressed before it can be reviewed — please see our Pull Request quality criteria. Issues found:
What to do next:
There is no rush — take your time and work at your own pace. We appreciate your contribution and are happy to wait for updates. If you have questions, feel free to ask on the Airflow Slack. Note: This comment was drafted by an AI-assisted triage tool and may contain mistakes. Once you have addressed the points above, an Apache Airflow maintainer — a real person — will take the next look at your PR. We use this two-stage triage process so that our maintainers' limited time is spent where it matters most: the conversation with you. |
|
Closing this PR — the fix was already merged via #64703. Thanks! |
Summary
Task-level OTel metrics (e.g.
ti.finish) are silently dropped in forked task subprocesses because the OTel Python SDK'sOnce()guard onset_meter_provider()survivesfork().Root cause:
stats.pycorrectly detects PID mismatches after fork and callsotel_logger.get_otel_logger()to re-initialize. This creates a freshMeterProviderand callsmetrics.set_meter_provider(), but the SDK's_METER_PROVIDER_SET_ONCE._done = Trueflag inherited from the parent blocks the call. The child ends up with the parent's stale provider whosePeriodicExportingMetricReaderbackground thread is dead after fork.Fix: Reset the SDK's provider state in
get_otel_logger()before callingset_meter_provider(). Sincestats.pyonly calls the factory after detecting a PID mismatch, this reset only runs in forked children that need a fresh provider.Closes #64690
Test plan
Once._done = True(forked child state) and verifiesget_otel_logger()successfully sets a newMeterProviderti.finishmetrics appear in Grafana🤖 Generated with Claude Code