Skip to content

E1+E3: reduce relay ingest/fan-out DB round trips; ack p99 −7–16%, fd p99 −6–28%, p999 tails −29–53% vs PR #1453 tip#1454

Merged
tlongwell-block merged 4 commits into
mainfrom
eva/relay-perf-e1-channel-row
Jul 2, 2026
Merged

E1+E3: reduce relay ingest/fan-out DB round trips; ack p99 −7–16%, fd p99 −6–28%, p999 tails −29–53% vs PR #1453 tip#1454
tlongwell-block merged 4 commits into
mainfrom
eva/relay-perf-e1-channel-row

Conversation

@tlongwell-block

Copy link
Copy Markdown
Collaborator

Stacked on #1453 (base: eva/relay-perf-w8-arc-verify; retargets to main when #1453 merges). Three commits, each independently reviewed GREEN by Wren against Quinn's correctness rulings (RESEARCH/RELAY_PERF_CORRECTNESS.md §4.8 + phase-2 addendum, §4.10):

What

  • d33b4636 — E1-phase-1: fetch channel row once per ingest request. Ingest re-SELECTed the full channel row up to three times per accepted event (membership open-visibility fallback, archived gate, join-visibility check). One community-scoped fetch after channel_id resolution now feeds all three. Missing-row semantics preserved per gate (incl. kind:9007 pre-create). No cross-request cache, no invalidation surface.
  • 42dd950d — E1-phase-2: thread channel visibility from ingest into fan-out. Visibility resolved once at ingest through the same channel_visibility_cached gate fan-out uses (seeded with the phase-1 row) and threaded into dispatch_persistent_event as a ThreadedChannelVisibility bundle. Ruling fences, verified in review:
    1. Fail-closed: lookup failure/missing row threads None; fan-out does its own fresh fail-closed lookup. None is never "assume open".
    2. Threaded read goes through channel_visibility_cached — cached private wins over the prefetched row; row-derived private still populates the cache.
    3. The value travels bundled with the (community_id, channel_id) it was resolved under and is consulted only on exact id equality at fan-out; mismatch falls back fresh (channel UUIDs collide across communities, Inv_LabelPropagation). Pubsub/cross-node and ephemeral paths pass None — threading is same-request/same-node only.
  • 30d29414 — E3: per-channel enabled-workflow cache with sync invalidation. moka cache in WorkflowEngine keyed (community_id, channel_id), TTL 10s (ruling allows ≤30s), negatives cached — the common no-workflow channel skips a per-event SELECT. Full mutation-site audit: the only live writers of trigger eligibility/channel binding are kind:30620 command upsert and NIP-09 a-tag delete; both invalidate synchronously (delete_workflow_for_owner now RETURNING channel_id). Unused buzz-db mutators carry doc fences requiring shared invalidation from future callers. No cross-pod invalidation, deliberately: triggering is not an access-control fence; the worst cross-pod case is a just-mutated workflow mis-firing/missing for ≤10s.

Bench (Sami, three-protocol A/B vs c61b4c14 = #1453 tip; all runs 0 timeouts, warmup discarded)

run metric raw baseline W8+fix (c61b4c1) E1+E3 (30d2941) Δ vs W8+fix
same-pod r200×15s ack p50 2.22 1.65 1.72 +0.07
ack p99 3.07 2.57 2.17 −0.40 (−16%)
ack p999 8.09 5.56 2.95 −2.61 (−47%)
fd p50 2.21 2.35 2.09 −0.26 (−11%)
fd p99 3.05 3.41 2.62 −0.79 (−23%)
fd p999 8.07 7.28 3.57 −3.71 (−51%)
same-pod r300×30s ack p50 1.74 1.38 1.38 +0.00
ack p99 2.48 2.19 1.83 −0.36 (−16%)
ack p999 3.32 6.22 2.91 −3.32 (−53%)
fd p50 1.74 1.98 1.69 −0.28 (−14%)
fd p99 2.46 3.05 2.19 −0.87 (−28%)
fd p999 3.29 7.26 3.49 −3.77 (−52%)
cross-pod r300×30s ack p50 1.81 1.46 1.42 −0.04
ack p99 2.53 1.95 1.82 −0.13 (−7%)
ack p999 6.98 4.56 3.23 −1.33 (−29%)
fd p50 1.80 2.11 2.02 −0.09
fd p99 2.53 2.76 2.60 −0.17 (−6%)
fd p999 7.35 5.80 4.12 −1.67 (−29%)

Attribution by mechanism: E1-phase-1/2 remove channel-row/visibility SELECTs from accepted channel-event paths (sender-side ack win); E3 removes the per-event no-workflow SELECT via negative cache (large p999 tail win). Receiver-side first_delivery p99 improved on every protocol — same-pod fd is now below raw baseline, so W1's local spawn-hop trade (recorded on #1453) is paid back by this stack. No regressions: ack p50 deltas are noise, zero timeouts. Run files: RESEARCH/RELAY_PERF_BENCH_RUNS/e1e3-30d29414-*.

Validation

  • cargo test -p buzz-relay -p buzz-workflow -p buzz-db: relay 438 lib (incl. 3 new phase-2 fence tests) + 1 main, workflow 148, db 79 — 0 failed
  • cargo clippy --all-targets clean, cargo fmt --check clean, pre-push hooks green
  • Independent review: Wren verified all three fences literally in code and re-ran the package suites at tip

Base automatically changed from eva/relay-perf-w8-arc-verify to main July 2, 2026 12:46
npub1qyvc0c5kl4gqv2fd97fsk46tu378sqgy35vc83rvgfwne90sel7s0ed67d and others added 3 commits July 2, 2026 08:47
E1 (correctness ruling §4.8, GREEN): ingest re-SELECTed the full channel
row up to three times per accepted event — the membership open-visibility
fallback, the archived-channel gate, and the join-request visibility
check each issued their own community-scoped get_channel. Fetch the row
once after channel_id resolution and thread it through all three gates.

Within-request threading only — no cross-request cache, so there is no
invalidation surface. The row is community-scoped (Inv_LabelPropagation:
channel UUIDs collide across communities). Missing-row behavior per gate
is unchanged: membership fallback treats it as not-open, the archived
gate skips (kind:9007 creates the channel later in the same request),
and join requests still reject with 'channel not found'.

check_channel_membership takes the row as Option; the ephemeral-event
path (handlers/event.rs) has no fetched row and passes None, keeping its
existing lookup.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
E1 phase-2 (correctness ruling §4.8 phase-2 addendum, GREEN with three
fences): fan-out re-resolved channel visibility even though ingest just
fetched the channel row for the same event. Resolve visibility once at
ingest — through the same channel_visibility_cached gate fan-out uses,
seeded with the phase-1 once-per-request row — and thread it into
dispatch_persistent_event as a ThreadedChannelVisibility bundle. Saves
one visibility SELECT per accepted channel event on the fan-out path.

Quinn's three fences, verbatim from the ruling:

1. Fail-closed on error is preserved. Ingest-side lookup failure (or a
   missing row: global events, kind:9007 pre-create) threads None, and
   fan-out performs its own fresh fail-closed lookup exactly as before.
   'No threaded visibility' is never interpreted as 'assume open'.

2. The threaded read goes through channel_visibility_cached, not raw
   get_channel. The prefetched row only replaces the DB read inside the
   gate: a cached 'private' still wins over the row, and a 'private'
   read from the row still populates the cache.

3. The threaded value stays community-scoped through fan-out. It travels
   bundled with the (community_id, channel_id) it was resolved under,
   and filter_fanout_by_access consults it only on exact equality with
   the fan-out's own (community_id, channel_id) — anything else falls
   through to the fresh lookup (channel UUIDs collide across
   communities, Inv_LabelPropagation).

Pubsub (cross-node) and ephemeral fan-out paths pass None — threading is
same-request, same-node only. Membership checks are unchanged and stay
fresh; the threaded value only replaces the visibility SELECT.

Three fence tests added in fanout_access: mismatched bundle falls back
to the fresh fail-closed lookup; matching 'private' gates to members
only; matching 'open' passes through without a DB read.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
…invalidation

E3 (correctness ruling §4.10): every ingested channel event ran a
list_enabled_channel_workflows SELECT to find trigger candidates, and
most channels have no workflows at all. Add a moka look-aside cache in
WorkflowEngine keyed (community_id, channel_id) so the trigger path hits
the DB once per channel per TTL instead of once per event.

§4.10 fences:
- Key is community-scoped (community_id, channel_id) — channel UUIDs
  collide across communities (Inv_LabelPropagation).
- TTL is 10s (ruling allows ≤30s), matching the relay's other moka
  caches.
- Negative results are cached: an empty list is inserted like any other,
  which is where most of the win is.
- Synchronous invalidation at every live mutation site. Audit (per
  Wren's broad definition — ingest upsert/delete, HTTP/API toggles,
  admin/test helpers, soft-delete/reactivation): the only paths that
  write trigger eligibility or channel binding are the kind:30620
  command upsert and NIP-09 a-tag deletion; both invalidate immediately
  after the DB write. delete_workflow_for_owner now RETURNING channel_id
  so the deletion path invalidates without a second lookup. The unused
  buzz-db mutators (create_workflow, update_workflow,
  update_workflow_status, set_workflow_enabled, delete_workflow) have no
  callers anywhere in the workspace — CLI and desktop route through
  event submission, webhook and approval-resume only write
  workflow_runs — and each carries a doc note requiring shared
  invalidation from any future caller.

Consistency: no cross-pod invalidation, deliberately. Workflow
triggering is not an access-control fence; the worst case on another
pod is a just-deleted workflow firing (or a just-created one missing
events) for up to 10s. The same TTL bounds the look-aside fill race.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
@tlongwell-block tlongwell-block force-pushed the eva/relay-perf-e1-channel-row branch from 30d2941 to eb38833 Compare July 2, 2026 12:49
…usted-input only)

Two new quick-xml advisories published 2026-07-02 broke the Security CI
gate on every branch. Both are DoS-class and require attacker-controlled
XML; our locked versions parse only trusted input (rust-s3 responses from
our own S3/MinIO endpoint; plist reads of local macOS system files).
The patched release (>= 0.41.0) is unreachable until rust-s3 and
plist/netdev bump their requirements. Documented ignores, matching the
existing pattern for RUSTSEC-2024-0384/0436.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
@tlongwell-block tlongwell-block merged commit a504ad6 into main Jul 2, 2026
29 checks passed
@tlongwell-block tlongwell-block deleted the eva/relay-perf-e1-channel-row branch July 2, 2026 13:09
tlongwell-block pushed a commit that referenced this pull request Jul 2, 2026
Brings the branch current with main (~20 commits, incl. relay perf #1453/#1454
and mention ranking #1431). One conflict resolved in useMentions.ts: kept
main's rankMentionCandidates pipeline, re-applied this branch's suggestion
slice change (Math.max(MENTION_SUGGESTION_LIMIT, mentionCandidates.length)).

Verified post-merge: tsc --noEmit clean, biome check clean, desktop unit
tests 1475/1475, cargo check --workspace clean.

Co-authored-by: Tyler Longwell <tlongwell@block.xyz>
Signed-off-by: Tyler Longwell <tlongwell@block.xyz>
wpfleger96 added a commit that referenced this pull request Jul 2, 2026
…into HEAD

* origin/paul/nip-am-agent-turn-metrics:
  fix(profile): consolidate agent profile runtime metadata (#1451)
  fix(desktop): simplify workspace rail badges (#1462)
  perf(desktop): instant channel switching — non-blocking first paint, persisted snapshots (#1452)
  perf(relay): bounded-concurrency multi-filter query execution (S2) (#1457)
  fix(desktop): classify timeline prepends so history loads don't bump unread (#1416)
  fix(desktop): quiet gate for workspace switches instead of boot splash (#1449)
  fix(read-path): reach complete threads, dense-second timelines, and all people in the GUI (#1418)
  E1+E3: reduce relay ingest/fan-out DB round trips; ack p99 −7–16%, fd p99 −6–28%, p999 tails −29–53% vs PR #1453 tip (#1454)
  perf(relay): defer post-commit dispatch and avoid verify clone (#1453)
  fix(relay): include git hook tools in runtime image (#1326)
  feat(chart): per-pod emptyDir git scratch when persistence disabled (multi-replica HA) (#1450)
  fix(relay): remove media bearer-token auth (#1444)
  fix(desktop): stop search shortcut from hijacking the sidebar (#1447)

Co-authored-by: Will Pfleger <pfleger.will@gmail.com>
Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant