[10125] [encode path] Minor optimizations to arrow-flight#10137
[10125] [encode path] Minor optimizations to arrow-flight#10137Rich-T-kid wants to merge 7 commits into
Conversation
| } | ||
|
|
||
| /// Place the `FlightData` in the queue to send | ||
| #[inline] |
There was a problem hiding this comment.
The compiler very likely could have inlined this, but I think its work adding this explicitly.
|
run benchmarks flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/minor-arrow-flight-opt (d02e297) to 826b808 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
seems like its mostly noise |
its interesting that this seems to always regress |
2c00600 to
337abd5
Compare
…mory to fill array data
337abd5 to
094579b
Compare
|
@Jefffrey I meant to ping you on this PR . Sorry about that! |
|
run benchmarks flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/minor-arrow-flight-opt (505fb20) to 826b808 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
Nice, regressions are gone. should re-run when 54faeda gets merged. I expected a larger improvement for larger rows/columns batches. I'll profile & update the PR |
7c6f70f to
37c7231
Compare
37c7231 to
166e2e6
Compare
|
@alamb could you run the benchmarks for arrow-flight? 🚀 |
|
run benchmark flight |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing rich-T-kid/minor-arrow-flight-opt (166e2e6) to 826b808 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
|
similar story here. encode is path shows good improvements but the roundtrip is the same or slightly worse. |
Which issue does this PR close?
starting small 😄
Rationale for this change
The arrow-flight encode path was allocating intermediate Vecs to hold data that was immediately iterated and discarded. Replacing these with lazy iterators and inlining the one helper that existed only to loop removes allocations that served no purpose beyond bridging two adjacent lines of code.
What changes are included in this PR?
[commit #1]
Impl<Iterator>.[commit #2]
[commit #3]
CompressionContexttoIpcWriteContextand added anfbb: FlatBufferBuilder<'static>field to it[commit #4]
IpcWriteContextgains a scratch:Vec<u8>field. When set before a call toIpcDataGenerator::encode(), the existing allocation is reused instead of allocating a fresh buffer for each batch's arrow data body.FlightIpcEncodermaintains anArrowDataPool, a small pool of Arc<Mutex<Vec<Vec>>> buffers pre-sized to the gRPC message limit (2 MiB). Before eachencode()call, a buffer is acquired from the pool and placed inIpcWriteContext::scratch. After encoding, the buffer is wrapped inPooledBufand handed to Bytes::from_owner; when the Bytes is dropped (after the gRPC frame is sent), the buffer is automatically returned to the pool rather than freed.[commit #5 & 6]
acquiremethod to also pre-allocate 2MB of space in the vectoripc::encode()calls. the buffers are pre-allocated to themax_flight_data_sizeas an estimate. This means no intermediate vector copies to larger vectorscommit [ #7] final commit
arrow-ipc::encode()sink fromIpcBodySink::Write()toIpcBodySink::collect()memcpyto new vectors.memcpy. this is also why the profile shows alot of memcpy or _platform_memmove on mac.output buffer size changes
This PR also changes the size of buffers being output by
split_batch_for_grpc_response()The old algorithm computed n_batches first via ceiling division, then derived rows_per_batch from that:
This evenly distributes rows across chunks, meaning each buffer ends up smaller than max on average. Thus leaving capacity unused.
The new algorithm works directly from the target size:
This packs each buffer as close to max as possible before moving on.
This matters because the output buffers are pre-allocated to max_flight_data_size. Since the allocation cost is already paid upfront, the only cost of filling a buffer is the memcpy itself. As the profiles show, most time is spent serializing RecordBatches to IPC format, doing that for as many rows as possible in one pass, followed by a single large memcpy, is faster than multiple smaller serializations and copies. Leaving pre-allocated capacity unused means splitting work across more messages than necessary, each carrying its own network and serialization overhead.
note: I expect to have to tune this a bit. this is because the size that is used to determine the total size of the record batch isn't exact. strings vary from row to row, so its hard to get the math 100% correct. but the closer we can get to the max_size the better
Are these changes tested?
yes
Are there any user-facing changes?
no