Chunk columnar retarget#769
Draft
frankmcsherry wants to merge 7 commits into
Draft
Conversation
Reimplement the columnar trace on the `Chunk` trait (`trace/chunk/`) instead of the bespoke batcher + `OrdValBatch`-backed spine. `ColChunk<U>` backs a chunk with the `UpdatesTyped` trie, resident or paged. Its four transducers delegate to the reused trie-native survey merge (`trie_merger`); the harness supplies the batch, straddle cursor, batcher, builder, and spine. `ValSpine`/`ValBatcher`/`ValBuilder` now re-export that harness over `ColChunk`, and trace merges are trie-native (they ran through ord_neu's row-oriented merger before). Deletes the machinery the harness replaces: `batcher.rs` (the bespoke `MergeBatcher`), `ValMirror`, and `trie_merger`'s driver shell (`merge_batches`/`ChainBuilder`/`form_chunks`/`merge_iterator`), keeping the survey/merge core that `ColChunk` drives. Spill moves onto `Chunk::settle`: `ColChunk` is `Resident | Paged`, where a paged chunk keeps resident bounds + len and a byte handle, materializing through a `OnceCell` cache on read. `settle` pages committed chunks out via the new per-worker `spill` controller once over a record budget; the backend (`BytesStore`/`BytesSource`) is pluggable and lives with the caller. Adds an `Updates` byte codec and rewires `columnar_spill` onto it. Known limitation: `settle` sees only its local output, not timely's whole batcher queue, so eviction is an approximate per-worker budget rather than the old exact head-reserve — it bounds memory and round-trips correctly. Adds `chunk_bench` (layout microbenchmark) and a `col` mode to `chunks`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both `Chunk` impls ran the same maximal-packing `settle` (carry, coalesce adjacent sub-TARGET chunks, peel over-TARGET ones); only three operations differed. Lift the algorithm into a free `pack` helper on the harness that `settle` can delegate to, parameterized by closures for those operations: - coalesce two adjacent chain chunks, - split a chunk at `n` updates, - seal a committed chunk (compress / spill; identity to keep). `settle` stays a required trait method, so nothing is forced on implementors — they opt in by calling `pack`. `vec`'s `settle` passes make_mut-extend / split_off / identity; `col`'s passes meld / split_at / a `seal_chunk` that pages via the spiller. The packing lives once; any impl gets it by delegating, or writes its own `settle` and ignores `pack`. Also simplify the spill policy to a high-water mark (keep the first `budget` records resident, page the rest): the old running inc/dec double-counted once coalescing re-counted the growing carry, so the example stopped paging. The high-water mark is monotonic and robust; the default example run again spills (~35M records, 2x lz4) and round-trips. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`ColumnarUpdate` required each field's columnar `Container: Debug`, but nothing needs it: `BatchContainer`'s only supertrait is `'static`, `Layout` puts no `Debug` on its containers, neither `Coltainer`/`ChunkBatch` nor the old `OrdValBatch` derives it, and the lone `Debug for UpdatesTyped` impl prints just the type name. The bound was the sole reason `isize`/`usize`-keyed columnar arrangements didn't compile: their `Isizes`/`Usizes` containers (which re-encode pointer-width ints as `i64` for portable serialization) don't implement `Debug`, though the values themselves are `Debug + Ord`. Drop `+ Debug` from the container bounds (keeping the value-level `Debug` on K/V/T/R), and restore the `chunks` example's `col` mode to `isize` + `InputSession::insert` — it had been switched to `i64` + `update(_, 1i64)` only to satisfy the bound. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`spines.rs` and `chunks.rs` were near-duplicate end-to-end `arrange`+`join` benchmarks. Fold `chunks` into `spines` by adding a `vec` mode (the `Vec`-backed `Chunk` trace, via `ContainerChunker<VecChunk>`, reusing the row workload). `spines` now covers all four backends — `key`, `val`, `vec`, `col` — on `String` keys; `chunks.rs` is removed. (`col` keeps the native columnar input path; the `ContainerChunker<ColChunk>` path it had is still exercised by the chunk_bench benchmark.) `chunk_bench` is a microbenchmark, not an example — it isolates per-`Chunk` build/merge/scan and resident memory (with its own counting allocator), which is a different instrument from an end-to-end spine benchmark and can't share a binary (one global allocator). Move it to `benches/chunk_bench.rs` with `harness = false`; run via `cargo bench --bench chunk_bench -- [updates]`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`col` reached the (now chunk-backed) columnar trace via `columnar/`'s input plumbing — `ValColBuilder` / `ValPact` / `ValChunker` and a bespoke `ColWorkload` that formatted into a reusable `String` buffer. Switch it to the same generic `Chunk` path `vec` uses: `ContainerChunker<ColChunk>` + `trace::chunk::col`, fed by the shared input harness. `vec` and `col` are now symmetric (identical input path, differing only in chunk layout), so the comparison is apples-to-apples, and `spines` no longer depends on `columnar/`'s input stack (still exercised by the `columnar` and `columnar_spill` examples). With every mode now sharing one input type, the `Workload` trait + `Box<dyn>` were vestigial; collapsed to a single concrete `Workload` struct. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`columnar/arrangement` held a confusing mix: substrate (`Coltainer`,
`trie_merger`), container-glue (`TrieChunker`), and trace-name aliases
(`Val*`) that re-exported `chunk::col` types — so naming a columnar trace
meant reaching "around" into `arrangement`, even though the trace lives in
`trace/chunk/col`. Dissolve it, sorting each piece to where it belongs:
- `Coltainer` → `columnar/layout.rs` (it's the `BatchContainer` half of
`ColumnarLayout`; pure substrate).
- `trie_merger` → `columnar/trie_merger.rs` (the trie-native merge core, beside
the `updates` trie it operates on).
- `TrieChunker` → `columnar/chunker.rs` (the `RecordedUpdates → ColChunk`
container-glue chunker).
- The `Val*` aliases move to `columnar/mod.rs` as thin re-exports of the
canonical `trace::chunk::col` harness types.
`chunk/col.rs` now imports substrate directly (`columnar::{layout, trie_merger,
updates, spill}`) with no lateral hop through `arrangement`. Pure motion: no
behavior change, build/tests clean, spill smoke unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ace faces
Rework the columnar work into a self-contained storage subsystem rooted at
`columnar/`, organized by the function DD needs rather than by implementation:
columnar/
layout · updates · trie_merger DATA — the shared columnar core
collection/ COLLECTION face: RecordedUpdates (+ Negate/
Enter/Leave/ResultsIn), Builder, Pact, and
the columnar operators (join_function, ...)
trace/ TRACE face: ColChunk (a `Chunk` impl) with
Spine / Batcher / Builder / Chunker, + spill
Both faces derive from the same data; they are linked only by two bridge
operators (the trace `Chunker`, collection->batch; and `as_recorded_updates`,
trace->collection). The generic `Chunk` trait + harness stay in `trace/chunk/`
(layout-agnostic, shared with `vec`); `columnar::trace::ColChunk` implements it.
This supersedes the interim placement under `trace/chunk/col/`: the collection
surface (RecordedUpdates, join_function, ...) was mis-homed inside the
trace/chunk module when it is collection — not chunk — machinery. Rooting
everything under `columnar/` makes it a storage variant that borrows DD's
abstractions but can be sheared off without changing DD's own structure — a
template for future storage variants.
Exports use BATCH/TRACE vocabulary (Spine/Batcher/Builder/Chunker), dropping the
historical `Val*` prefix; `ColChunk` keeps its name as the honest `Chunk` impl.
Also folds in the wire/spill codec dedup (`RecordedUpdates` delegates its body to
`Updates::write_to`/`read_from`) and its round-trip test.
Examples (`columnar`, `columnar_spill`, `spines`) and `chunk_bench` updated to
the new paths. Pure motion otherwise: build/tests clean, runtime smokes unchanged.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
944a74c to
fef2ac3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Demonstration of retargeting the
columnar/container/batch/trace stack onto thechunk/framework.