Skip to content

Tovacinni/fix averager memory leak#131

Open
tovacinni wants to merge 3 commits into
hyrex-labs:developfrom
outerport:tovacinni/fix-averager-memory-leak
Open

Tovacinni/fix averager memory leak#131
tovacinni wants to merge 3 commits into
hyrex-labs:developfrom
outerport:tovacinni/fix-averager-memory-leak

Conversation

@tovacinni

Copy link
Copy Markdown

No description provided.

tovacinni and others added 3 commits January 25, 2026 14:05
The executor's three TimeSeriesAverager instances
(num_distinct_queues, refresh_queue_duration, dequeue_duration) were
appending a DataPoint on every executor poll iteration and never pruning.
Because run_round_robin_loop has no sleep when the queue is empty, an
idle worker busy-polls the database hundreds of times per second, and
each of those iterations submits to 2-3 averagers. On a quiet dev
Postgres this leaked roughly 40 MB/min per worker, which at -p 8
produced ~320 MB/min of RSS growth and eventually OOM-killed unrelated
processes on the host (dbus, wireplumber, etc.), which in turn took down
NetworkManager as collateral damage.

Swap data_points for a bounded collections.deque with maxlen=10_000 so
append is O(1), old points are auto-evicted, and worst-case memory is
capped at ~2 MB per averager regardless of poll rate. clear() and
prune_data_older_than() are updated to preserve the deque's maxlen.

At ~100 submits/sec the buffer still holds a couple of minutes of
stats, which is enough for the existing get_time_series() consumers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The WorkerRootProcess message-listener thread was started without
daemon=True, and its body blocks indefinitely on a multiprocessing
Queue.get(). The shutdown path tries to wake it with a put(None)
sentinel, but after executor processes are SIGKILL'd their queue
feeder threads can die mid-send and leave the parent receiver wedged.
When join(timeout=5.0) returns with the thread still alive, the code
emits a warning and continues — but Python won't exit the interpreter
while a non-daemon thread is alive, so the worker process hangs
forever after a Ctrl+C / SIGTERM.

Mark the listener thread as a daemon at creation time so the
interpreter can reap it on exit regardless of queue state. Also
update the misleading comment in stop() that referred to promoting
the thread to a daemon during shutdown — that's impossible once the
thread has started (Thread.daemon setter raises RuntimeError).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant