Skip to content

Add logging to detect try number race#62703

Merged
ephraimbuddy merged 2 commits into
apache:mainfrom
astronomer:add-logging-to-detect-try-number-race
Mar 3, 2026
Merged

Add logging to detect try number race#62703
ephraimbuddy merged 2 commits into
apache:mainfrom
astronomer:add-logging-to-detect-try-number-race

Conversation

@ephraimbuddy

Copy link
Copy Markdown
Contributor

This adds more logging to select places that try_number mismatch could happen and would help us detect and fix the issue.

Related: #57618


Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    GPT-5.3-codex

@boring-cyborg boring-cyborg Bot added the area:Scheduler including HA (high availability) scheduler label Mar 2, 2026
@ephraimbuddy ephraimbuddy requested a review from Copilot March 2, 2026 12:41

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds targeted logging (and unit tests) to help detect try_number mismatches/races in the scheduler flow, particularly around TI scheduling and executor event processing (related to #57618).

Changes:

  • Add a debug-gated post-update DB read in DagRun.schedule_tis() to warn when the persisted try_number differs from the expected value.
  • Add additional scheduler logs/warnings around queueing workloads and handling executor events with mismatched/multiple try_numbers.
  • Add/extend unit tests to assert the new warnings/logging behavior.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
airflow-core/src/airflow/models/dagrun.py Adds debug-gated DB verification and warning logging for try_number mismatches after scheduling.
airflow-core/src/airflow/jobs/scheduler_job_runner.py Adds more context-rich logs for queueing/scheduling and warnings for executor events with conflicting try_numbers.
airflow-core/tests/unit/models/test_dagrun.py Adds tests validating warning behavior for schedule_tis() try-number mismatch checks.
airflow-core/tests/unit/jobs/test_scheduler_job.py Extends/adds tests asserting new scheduler warnings via caplog.

Comment thread airflow-core/tests/unit/models/test_dagrun.py
Comment thread airflow-core/tests/unit/models/test_dagrun.py
Comment thread airflow-core/tests/unit/models/test_dagrun.py
Comment thread airflow-core/src/airflow/jobs/scheduler_job_runner.py
@ephraimbuddy ephraimbuddy added the type:misc/internal Changelog: Misc changes that should appear in change log label Mar 2, 2026
@ephraimbuddy ephraimbuddy added this to the Airflow 3.1.8 milestone Mar 2, 2026
@ephraimbuddy ephraimbuddy force-pushed the add-logging-to-detect-try-number-race branch from 2c3047e to 3bfd673 Compare March 3, 2026 13:21
This adds more logging to select places that try_number mismatch
could happen and would help us detect and fix the issue.

Related: apache#57618
@ephraimbuddy ephraimbuddy force-pushed the add-logging-to-detect-try-number-race branch from a7a5e17 to e0431ad Compare March 3, 2026 16:12
@ephraimbuddy ephraimbuddy merged commit 95784d9 into apache:main Mar 3, 2026
72 checks passed
@ephraimbuddy ephraimbuddy deleted the add-logging-to-detect-try-number-race branch March 3, 2026 20:21
@github-actions

github-actions Bot commented Mar 3, 2026

Copy link
Copy Markdown
Contributor

Backport failed to create: v3-1-test. View the failure log Run details

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-1-test Commit Link

You can attempt to backport this manually by running:

cherry_picker 95784d9 v3-1-test

This should apply the commit to the v3-1-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

If you don't have cherry-picker installed, see the installation guide.

ephraimbuddy added a commit that referenced this pull request Mar 3, 2026
* Log try_number mismatches during TI scheduling for HA race diagnosis

This adds more logging to select places that try_number mismatch
could happen and would help us detect and fix the issue.

Related: #57618

* Add tests

(cherry picked from commit 95784d9)
@vatsrahul1001 vatsrahul1001 removed this from the Airflow 3.1.8 milestone Mar 4, 2026
vatsrahul1001 pushed a commit that referenced this pull request Mar 4, 2026
* Add logging to detect try number race (#62703)

* Log try_number mismatches during TI scheduling for HA race diagnosis

This adds more logging to select places that try_number mismatch
could happen and would help us detect and fix the issue.

Related: #57618

* Add tests

(cherry picked from commit 95784d9)

* fixup! Add logging to detect try number race (#62703)

* fixup! fixup! Add logging to detect try number race (#62703)
vatsrahul1001 pushed a commit that referenced this pull request Mar 4, 2026
* Add logging to detect try number race (#62703)

* Log try_number mismatches during TI scheduling for HA race diagnosis

This adds more logging to select places that try_number mismatch
could happen and would help us detect and fix the issue.

Related: #57618

* Add tests

(cherry picked from commit 95784d9)

* fixup! Add logging to detect try number race (#62703)

* fixup! fixup! Add logging to detect try number race (#62703)
dominikhei pushed a commit to dominikhei/airflow that referenced this pull request Mar 11, 2026
* Log try_number mismatches during TI scheduling for HA race diagnosis

This adds more logging to select places that try_number mismatch
could happen and would help us detect and fix the issue.

Related: apache#57618

* Add tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:Scheduler including HA (high availability) scheduler type:misc/internal Changelog: Misc changes that should appear in change log

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants