Skip to content

[10125] [encode path] Minor optimizations to arrow-flight#10137

Open
Rich-T-kid wants to merge 7 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/minor-arrow-flight-opt
Open

[10125] [encode path] Minor optimizations to arrow-flight#10137
Rich-T-kid wants to merge 7 commits into
apache:mainfrom
Rich-T-kid:rich-T-kid/minor-arrow-flight-opt

Conversation

@Rich-T-kid

@Rich-T-kid Rich-T-kid commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

starting small 😄

Rationale for this change

The arrow-flight encode path was allocating intermediate Vecs to hold data that was immediately iterated and discarded. Replacing these with lazy iterators and inlining the one helper that existed only to loop removes allocations that served no purpose beyond bridging two adjacent lines of code.

What changes are included in this PR?

[commit #1]

  • Remove intermediate Vec allocations in encode path, replace these with Impl<Iterator>.
  • Cache num_rows before split closure
  • Remove queue_messages, inline call site, mark queue_message #[inline]

[commit #2]

  • pre-allocate the vector used to hold uncompressed data.
    • avoids build up of [64k,512k,4MB,12MB...]

[commit #3]

  • Renamed CompressionContext to IpcWriteContext and added an fbb: FlatBufferBuilder<'static> field to it
  • This avoids repeated heap allocations by reusing the same FlatBufferBuilder across writes, using its reset() method to clear state without deallocating

[commit #4]

  • IpcWriteContext gains a scratch: Vec<u8> field. When set before a call to IpcDataGenerator::encode(), the existing allocation is reused instead of allocating a fresh buffer for each batch's arrow data body.
  • arrow-flight's FlightIpcEncoder maintains an ArrowDataPool, a small pool of Arc<Mutex<Vec<Vec>>> buffers pre-sized to the gRPC message limit (2 MiB). Before each encode() call, a buffer is acquired from the pool and placed in IpcWriteContext::scratch. After encoding, the buffer is wrapped in PooledBuf and handed to Bytes::from_owner; when the Bytes is dropped (after the gRPC frame is sent), the buffer is automatically returned to the pool rather than freed.

[commit #5 & 6]

  • tuning the buffer pool, updated the acquire method to also pre-allocate 2MB of space in the vector
  • keep scratch buffer across multiple ipc::encode() calls. the buffers are pre-allocated to the max_flight_data_size as an estimate. This means no intermediate vector copies to larger vectors

commit [ #7] final commit

  • remove buffer pool. This was actually causing more overhead then letting the memory allocation handle pooling memory.
  • replaced arrow-ipc::encode() sink from IpcBodySink::Write() to IpcBodySink::collect()
  • ideally all RecordBatch buffers are written in O(1) time with no need to re-allocate and memcpy to new vectors.
    • extend_from_slice boils down to a very fast memcpy. this is also why the profile shows alot of memcpy or _platform_memmove on mac.

output buffer size changes

This PR also changes the size of buffers being output by split_batch_for_grpc_response()
The old algorithm computed n_batches first via ceiling division, then derived rows_per_batch from that:

n_batches    = ceil(size / max)
rows_per_batch = num_rows / n_batches

This evenly distributes rows across chunks, meaning each buffer ends up smaller than max on average. Thus leaving capacity unused.

The new algorithm works directly from the target size:

rows_per_batch = max * num_rows / size

This packs each buffer as close to max as possible before moving on.

This matters because the output buffers are pre-allocated to max_flight_data_size. Since the allocation cost is already paid upfront, the only cost of filling a buffer is the memcpy itself. As the profiles show, most time is spent serializing RecordBatches to IPC format, doing that for as many rows as possible in one pass, followed by a single large memcpy, is faster than multiple smaller serializations and copies. Leaving pre-allocated capacity unused means splitting work across more messages than necessary, each carrying its own network and serialization overhead.

note: I expect to have to tune this a bit. this is because the size that is used to determine the total size of the record batch isn't exact. strings vary from row to row, so its hard to get the math 100% correct. but the closer we can get to the max_size the better

Are these changes tested?

yes

Are there any user-facing changes?

no

@github-actions github-actions Bot added arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate labels Jun 12, 2026
}

/// Place the `FlightData` in the queue to send
#[inline]

@Rich-T-kid Rich-T-kid Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The compiler very likely could have inlined this, but I think its work adding this explicitly.

@gabotechs

Copy link
Copy Markdown
Contributor

run benchmarks flight

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4691665801-559-vg5z6 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/minor-arrow-flight-opt (d02e297) to 826b808 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                         main                                   rich-T-kid_minor-arrow-flight-opt
-----                         ----                                   ---------------------------------
encode/dict/65536x1           1.02    283.2±1.04µs   887.9 MB/sec    1.00    278.5±1.31µs   902.7 MB/sec
encode/dict/65536x8           1.01      8.7±0.07ms   232.0 MB/sec    1.00      8.5±0.18ms   235.3 MB/sec
encode/dict/8192x1            1.00     35.2±0.02µs   928.0 MB/sec    1.02     35.8±0.03µs   913.1 MB/sec
encode/dict/8192x8            1.02    301.6±1.68µs   866.3 MB/sec    1.00    296.2±1.29µs   882.1 MB/sec
encode/fixed/65536x1          1.03     10.2±0.02µs    47.8 GB/sec    1.00      9.9±0.02µs    49.2 GB/sec
encode/fixed/65536x8          1.02   1121.6±1.92µs     3.5 GB/sec    1.00   1099.7±2.33µs     3.6 GB/sec
encode/fixed/8192x1           1.01      3.2±0.01µs    19.2 GB/sec    1.00      3.1±0.01µs    19.5 GB/sec
encode/fixed/8192x8           1.00     17.7±0.04µs    27.6 GB/sec    1.03     18.2±0.02µs    26.8 GB/sec
encode/nested/65536x1         1.01     38.9±0.29µs    31.4 GB/sec    1.00     38.4±0.17µs    31.8 GB/sec
encode/nested/65536x8         1.03      3.1±0.01ms     3.2 GB/sec    1.00      3.0±0.01ms     3.3 GB/sec
encode/nested/8192x1          1.00      5.7±0.01µs    26.9 GB/sec    1.01      5.8±0.01µs    26.5 GB/sec
encode/nested/8192x8          1.00     48.9±0.13µs    25.0 GB/sec    1.00     48.8±0.08µs    25.0 GB/sec
encode/variable/65536x1       1.00     73.4±0.26µs    29.9 GB/sec    1.01     73.9±0.31µs    29.7 GB/sec
encode/variable/65536x8       1.00      5.2±0.06ms     3.4 GB/sec    1.00      5.2±0.07ms     3.4 GB/sec
encode/variable/8192x1        1.00      6.9±0.01µs    40.1 GB/sec    1.02      7.0±0.01µs    39.1 GB/sec
encode/variable/8192x8        1.01     89.4±0.15µs    24.6 GB/sec    1.00     88.9±0.22µs    24.7 GB/sec
roundtrip/dict/65536x1        1.00  1275.9±46.22µs   197.0 MB/sec    1.01  1284.9±45.94µs   195.7 MB/sec
roundtrip/dict/65536x8        1.00     14.4±0.63ms   140.0 MB/sec    1.14     16.3±0.56ms   123.2 MB/sec
roundtrip/dict/8192x1         1.00    205.6±5.43µs   158.8 MB/sec    1.01    208.7±5.77µs   156.5 MB/sec
roundtrip/dict/8192x8         1.00  1313.8±42.83µs   198.9 MB/sec    1.00  1315.5±50.14µs   198.6 MB/sec
roundtrip/fixed/65536x1       1.00    305.2±3.84µs  1638.6 MB/sec    1.02    310.5±4.65µs  1610.4 MB/sec
roundtrip/fixed/65536x8       1.01      2.2±0.07ms  1855.0 MB/sec    1.00      2.1±0.04ms  1870.2 MB/sec
roundtrip/fixed/8192x1        1.02     90.3±1.35µs   693.3 MB/sec    1.00     88.9±1.07µs   703.7 MB/sec
roundtrip/fixed/8192x8        1.00    323.9±3.75µs  1545.8 MB/sec    1.02    330.9±5.18µs  1513.4 MB/sec
roundtrip/nested/65536x1      1.00   843.8±41.42µs  1481.6 MB/sec    1.00   841.6±41.74µs  1485.6 MB/sec
roundtrip/nested/65536x8      1.00      9.4±0.67ms  1066.8 MB/sec    1.12     10.5±0.37ms   949.0 MB/sec
roundtrip/nested/8192x1       1.00    156.6±5.36µs   999.1 MB/sec    1.01    157.9±4.96µs   990.6 MB/sec
roundtrip/nested/8192x8       1.00   889.4±42.46µs  1407.3 MB/sec    1.01   896.2±45.08µs  1396.6 MB/sec
roundtrip/variable/65536x1    1.00  1203.2±34.81µs  1870.1 MB/sec    1.04  1254.1±70.01µs  1794.3 MB/sec
roundtrip/variable/65536x8    1.03     16.4±0.51ms  1094.7 MB/sec    1.00     16.0±0.43ms  1124.1 MB/sec
roundtrip/variable/8192x1     1.00    204.6±5.86µs  1375.8 MB/sec    1.01    206.5±5.97µs  1362.6 MB/sec
roundtrip/variable/8192x8     1.00  1204.0±33.06µs  1869.9 MB/sec    1.01  1217.2±28.50µs  1849.6 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 340.1s
Peak memory 98.5 MiB
Avg memory 36.7 MiB
CPU user 345.0s
CPU sys 73.4s
Peak spill 0 B

branch

Metric Value
Wall time 335.1s
Peak memory 99.5 MiB
Avg memory 36.6 MiB
CPU user 339.9s
CPU sys 76.9s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

seems like its mostly noise

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author
roundtrip/nested/65536x8      1.00      9.4±0.67ms  1066.8 MB/sec    1.12     10.5±0.37ms   949.0 MB/sec

its interesting that this seems to always regress

@Rich-T-kid Rich-T-kid changed the title [10125] Minor optimizations to arrow-flight [10125] [encode path] Minor optimizations to arrow-flight Jun 12, 2026
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch 2 times, most recently from 2c00600 to 337abd5 Compare June 12, 2026 18:02
Comment thread arrow-ipc/src/compression.rs Outdated
@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch from 337abd5 to 094579b Compare June 12, 2026 21:03
@Rich-T-kid

Rich-T-kid commented Jun 13, 2026

Copy link
Copy Markdown
Contributor Author

@Jefffrey I meant to ping you on this PR . Sorry about that!

@Jefffrey

Copy link
Copy Markdown
Contributor

run benchmarks flight

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4697178377-565-qnbtn 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/minor-arrow-flight-opt (505fb20) to 826b808 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                          main                                   rich-T-kid_minor-arrow-flight-opt
-----                          ----                                   ---------------------------------
encode/dict/65536x1            1.01    273.2±1.44µs   920.4 MB/sec    1.00    271.3±0.45µs   926.7 MB/sec
encode/dict/65536x16                                                  1.00     17.3±0.22ms   232.8 MB/sec
encode/dict/65536x4                                                   1.00   1180.2±4.95µs   852.1 MB/sec
encode/dict/65536x8            1.42      8.9±0.19ms   225.1 MB/sec    1.00      6.3±0.11ms   318.6 MB/sec
encode/dict/8192x1             1.00     35.2±0.03µs   928.7 MB/sec    1.00     35.2±0.04µs   927.0 MB/sec
encode/dict/8192x16                                                   1.00    630.1±2.04µs   829.3 MB/sec
encode/dict/8192x4                                                    1.00    143.2±0.12µs   912.3 MB/sec
encode/dict/8192x8             1.00    298.5±2.75µs   875.4 MB/sec    1.00    298.1±0.83µs   876.5 MB/sec
encode/fixed/65536x1           1.08     10.6±0.02µs    46.0 GB/sec    1.00      9.8±0.01µs    49.7 GB/sec
encode/fixed/65536x16                                                 1.00      2.4±0.03ms     3.3 GB/sec
encode/fixed/65536x4                                                  1.00     49.8±0.17µs    39.3 GB/sec
encode/fixed/65536x8           1.00   1110.2±5.22µs     3.5 GB/sec    1.02   1135.8±3.38µs     3.4 GB/sec
encode/fixed/8192x1            1.00      3.2±0.01µs    19.0 GB/sec    1.03      3.3±0.01µs    18.5 GB/sec
encode/fixed/8192x16                                                  1.00     36.2±0.18µs    27.0 GB/sec
encode/fixed/8192x4                                                   1.00      8.8±0.01µs    27.8 GB/sec
encode/fixed/8192x8            1.04     17.4±0.05µs    28.1 GB/sec    1.00     16.7±0.02µs    29.3 GB/sec
encode/nested/65536x1          1.00     28.1±0.20µs    43.5 GB/sec    1.04     29.3±0.30µs    41.7 GB/sec
encode/nested/65536x16                                                1.00      7.1±0.18ms     2.8 GB/sec
encode/nested/65536x4                                                 1.00  1485.8±19.84µs     3.3 GB/sec
encode/nested/65536x8          1.00      3.2±0.06ms     3.0 GB/sec    1.00      3.2±0.08ms     3.0 GB/sec
encode/nested/8192x1           1.16      6.8±0.01µs    22.6 GB/sec    1.00      5.8±0.01µs    26.2 GB/sec
encode/nested/8192x16                                                 1.00    148.7±0.41µs    16.4 GB/sec
encode/nested/8192x4                                                  1.00     21.3±0.03µs    28.7 GB/sec
encode/nested/8192x8           1.00     46.2±0.23µs    26.4 GB/sec    1.06     48.8±0.11µs    25.0 GB/sec
encode/variable/65536x1        1.59     81.4±0.51µs    27.0 GB/sec    1.00     51.2±0.22µs    42.9 GB/sec
encode/variable/65536x16                                              1.00     11.2±0.14ms     3.1 GB/sec
encode/variable/65536x4                                               1.00      2.4±0.05ms     3.6 GB/sec
encode/variable/65536x8        1.05      5.4±0.08ms     3.2 GB/sec    1.00      5.1±0.10ms     3.4 GB/sec
encode/variable/8192x1         1.17      7.0±0.01µs    39.1 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x16                                               1.00   1171.6±7.63µs     3.8 GB/sec
encode/variable/8192x4                                                1.00     24.9±0.04µs    44.2 GB/sec
encode/variable/8192x8         1.06     80.7±0.13µs    27.2 GB/sec    1.00     76.0±0.22µs    28.9 GB/sec
roundtrip/dict/65536x1         1.01  1330.0±45.25µs   189.0 MB/sec    1.00  1315.9±45.27µs   191.1 MB/sec
roundtrip/dict/65536x16                                               1.00     29.5±1.10ms   136.5 MB/sec
roundtrip/dict/65536x4                                                1.00      6.7±0.23ms   150.9 MB/sec
roundtrip/dict/65536x8         1.06     15.3±0.72ms   131.9 MB/sec    1.00     14.3±0.54ms   140.2 MB/sec
roundtrip/dict/8192x1          1.00    212.8±5.92µs   153.4 MB/sec    1.00    212.4±6.06µs   153.8 MB/sec
roundtrip/dict/8192x16                                                1.00      2.4±0.05ms   216.8 MB/sec
roundtrip/dict/8192x4                                                 1.00   687.6±23.18µs   190.0 MB/sec
roundtrip/dict/8192x8          1.00  1355.1±49.83µs   192.8 MB/sec    1.00  1357.6±52.34µs   192.5 MB/sec
roundtrip/fixed/65536x1        1.01    319.7±3.74µs  1564.3 MB/sec    1.00    315.2±4.71µs  1586.4 MB/sec
roundtrip/fixed/65536x16                                              1.00      7.0±0.22ms  1142.9 MB/sec
roundtrip/fixed/65536x4                                               1.00  1306.1±82.37µs  1531.6 MB/sec
roundtrip/fixed/65536x8        1.00      2.3±0.08ms  1733.1 MB/sec    1.00      2.3±0.06ms  1727.3 MB/sec
roundtrip/fixed/8192x1         1.04     95.5±1.40µs   655.5 MB/sec    1.00     92.2±1.00µs   678.7 MB/sec
roundtrip/fixed/8192x16                                               1.00    654.1±8.15µs  1531.1 MB/sec
roundtrip/fixed/8192x4                                                1.00    197.5±3.38µs  1267.8 MB/sec
roundtrip/fixed/8192x8         1.00    339.6±4.53µs  1474.5 MB/sec    1.00    338.7±5.18µs  1478.5 MB/sec
roundtrip/nested/65536x1       1.03   882.8±43.55µs  1416.1 MB/sec    1.00   859.9±42.06µs  1453.9 MB/sec
roundtrip/nested/65536x16                                             1.00     19.3±0.68ms  1036.5 MB/sec
roundtrip/nested/65536x4                                              1.00      3.8±0.23ms  1305.6 MB/sec
roundtrip/nested/65536x8       1.24     10.7±0.73ms   931.4 MB/sec    1.00      8.7±0.28ms  1152.5 MB/sec
roundtrip/nested/8192x1        1.03    162.9±5.47µs   960.6 MB/sec    1.00    158.7±5.99µs   986.1 MB/sec
roundtrip/nested/8192x16                                              1.00  1628.2±41.63µs  1537.4 MB/sec
roundtrip/nested/8192x4                                               1.00   470.5±21.40µs  1330.1 MB/sec
roundtrip/nested/8192x8        1.00   930.5±41.73µs  1345.1 MB/sec    1.00   926.5±44.00µs  1350.9 MB/sec
roundtrip/variable/65536x1     1.01  1249.7±39.83µs  1800.5 MB/sec    1.00  1236.9±36.21µs  1819.2 MB/sec
roundtrip/variable/65536x16                                           1.00     31.3±1.17ms  1150.4 MB/sec
roundtrip/variable/65536x4                                            1.00      8.1±0.31ms  1115.6 MB/sec
roundtrip/variable/65536x8     1.04     17.0±0.50ms  1059.5 MB/sec    1.00     16.4±0.70ms  1100.1 MB/sec
roundtrip/variable/8192x1      1.03    214.7±5.60µs  1310.8 MB/sec    1.00    208.7±6.21µs  1348.2 MB/sec
roundtrip/variable/8192x16                                            1.00      3.3±0.27ms  1367.4 MB/sec
roundtrip/variable/8192x4                                             1.00   680.5±24.00µs  1654.1 MB/sec
roundtrip/variable/8192x8      1.03  1267.2±30.87µs  1776.7 MB/sec    1.00  1228.6±32.30µs  1832.4 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 340.1s
Peak memory 100.2 MiB
Avg memory 38.1 MiB
CPU user 338.8s
CPU sys 75.8s
Peak spill 0 B

branch

Metric Value
Wall time 660.1s
Peak memory 146.3 MiB
Avg memory 47.0 MiB
CPU user 620.5s
CPU sys 187.9s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

Nice, regressions are gone. should re-run when 54faeda gets merged. I expected a larger improvement for larger rows/columns batches. I'll profile & update the PR

@Rich-T-kid Rich-T-kid force-pushed the rich-T-kid/minor-arrow-flight-opt branch from 37c7231 to 166e2e6 Compare June 18, 2026 17:45
@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

@alamb could you run the benchmarks for arrow-flight? 🚀

@alamb

alamb commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

run benchmark flight

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark running (GKE) | trigger
Instance: c4a-highmem-16 (12 vCPU / 65 GiB) | Linux bench-c4745385483-591-gfbs4 6.12.68+ #1 SMP Sat May 2 07:49:07 UTC 2026 aarch64 GNU/Linux

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected

Comparing rich-T-kid/minor-arrow-flight-opt (166e2e6) to 826b808 (merge-base) diff
BENCH_NAME=flight
BENCH_COMMAND=cargo bench --features=arrow,async,test_common,experimental,object_store --bench flight
BENCH_FILTER=
Results will be posted here when complete


File an issue against this benchmark runner

@adriangbot

Copy link
Copy Markdown

🤖 Arrow criterion benchmark completed (GKE) | trigger

Instance: c4a-highmem-16 (12 vCPU / 65 GiB)

CPU Details (lscpu)
Architecture:                            aarch64
CPU op-mode(s):                          64-bit
Byte Order:                              Little Endian
CPU(s):                                  16
On-line CPU(s) list:                     0-15
Vendor ID:                               ARM
Model name:                              Neoverse-V2
Model:                                   1
Thread(s) per core:                      1
Core(s) per cluster:                     16
Socket(s):                               -
Cluster(s):                              1
Stepping:                                r0p1
BogoMIPS:                                2000.00
Flags:                                   fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bf16 dgh rng bti
L1d cache:                               1 MiB (16 instances)
L1i cache:                               1 MiB (16 instances)
L2 cache:                                32 MiB (16 instances)
L3 cache:                                80 MiB (1 instance)
NUMA node(s):                            1
NUMA node0 CPU(s):                       0-15
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Not affected
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; CSV2, BHB
Vulnerability Srbds:                     Not affected
Vulnerability Tsa:                       Not affected
Vulnerability Tsx async abort:           Not affected
Vulnerability Vmscape:                   Not affected
Details

group                          main                                   rich-T-kid_minor-arrow-flight-opt
-----                          ----                                   ---------------------------------
encode/dict/65536x1            1.06    286.7±2.38µs   876.8 MB/sec    1.00    269.5±1.65µs   932.9 MB/sec
encode/dict/65536x16                                                  1.00     19.4±0.30ms   207.7 MB/sec
encode/dict/65536x4                                                   1.00      4.0±0.31ms   252.0 MB/sec
encode/dict/65536x8            1.00      8.1±0.73ms   248.1 MB/sec    1.24     10.1±0.86ms   199.9 MB/sec
encode/dict/8192x1             1.02     36.2±0.04µs   902.3 MB/sec    1.00     35.5±0.08µs   921.2 MB/sec
encode/dict/8192x16                                                   1.00    606.9±5.60µs   861.1 MB/sec
encode/dict/8192x4                                                    1.00    139.2±0.19µs   938.4 MB/sec
encode/dict/8192x8             1.08    311.4±3.10µs   839.1 MB/sec    1.00    287.3±2.07µs   909.6 MB/sec
encode/fixed/65536x1           1.04     10.3±0.03µs    47.5 GB/sec    1.00      9.8±0.03µs    49.6 GB/sec
encode/fixed/65536x16                                                 1.00      2.2±0.01ms     3.6 GB/sec
encode/fixed/65536x4                                                  1.00     50.4±1.04µs    38.8 GB/sec
encode/fixed/65536x8           8.56  1087.5±30.99µs     3.6 GB/sec    1.00    127.1±9.93µs    30.7 GB/sec
encode/fixed/8192x1            1.02      3.1±0.02µs    19.5 GB/sec    1.00      3.1±0.01µs    19.9 GB/sec
encode/fixed/8192x16                                                  1.00     37.7±0.59µs    26.0 GB/sec
encode/fixed/8192x4                                                   1.00      8.5±0.02µs    28.8 GB/sec
encode/fixed/8192x8            1.03     17.1±0.03µs    28.5 GB/sec    1.00     16.6±0.08µs    29.5 GB/sec
encode/nested/65536x1          1.00     28.8±0.21µs    42.5 GB/sec    1.13     32.5±0.59µs    37.5 GB/sec
encode/nested/65536x16                                                1.00      5.7±0.44ms     3.4 GB/sec
encode/nested/65536x4                                                 1.00   178.4±10.14µs    27.4 GB/sec
encode/nested/65536x8          1.26      2.9±0.11ms     3.3 GB/sec    1.00      2.3±0.14ms     4.2 GB/sec
encode/nested/8192x1           1.15      6.7±0.02µs    22.9 GB/sec    1.00      5.8±0.01µs    26.4 GB/sec
encode/nested/8192x16                                                 1.00    101.4±5.36µs    24.1 GB/sec
encode/nested/8192x4                                                  1.00     20.0±0.11µs    30.5 GB/sec
encode/nested/8192x8           1.06     46.9±0.21µs    26.1 GB/sec    1.00     44.2±0.32µs    27.6 GB/sec
encode/variable/65536x1        1.44     64.0±2.37µs    34.3 GB/sec    1.00     44.6±1.05µs    49.3 GB/sec
encode/variable/65536x16                                              1.00     11.2±0.98ms     3.1 GB/sec
encode/variable/65536x4                                               1.00   279.5±29.03µs    31.5 GB/sec
encode/variable/65536x8        1.63      5.4±0.50ms     3.3 GB/sec    1.00      3.3±0.03ms     5.3 GB/sec
encode/variable/8192x1         1.23      7.4±0.01µs    37.3 GB/sec    1.00      6.0±0.01µs    45.8 GB/sec
encode/variable/8192x16                                               1.00   159.3±13.58µs    27.6 GB/sec
encode/variable/8192x4                                                1.00     26.2±0.32µs    42.0 GB/sec
encode/variable/8192x8         1.53     83.3±1.90µs    26.4 GB/sec    1.00     54.5±2.36µs    40.3 GB/sec
roundtrip/dict/65536x1         1.00  1284.5±49.82µs   195.7 MB/sec    1.02  1306.8±57.29µs   192.4 MB/sec
roundtrip/dict/65536x16                                               1.00     29.9±4.15ms   134.5 MB/sec
roundtrip/dict/65536x4                                                1.00      7.2±0.35ms   138.8 MB/sec
roundtrip/dict/65536x8         1.00     15.2±0.88ms   132.4 MB/sec    1.09     16.5±1.06ms   121.7 MB/sec
roundtrip/dict/8192x1          1.00    208.4±6.37µs   156.7 MB/sec    1.01    211.2±5.66µs   154.7 MB/sec
roundtrip/dict/8192x16                                                1.00      2.7±0.09ms   197.1 MB/sec
roundtrip/dict/8192x4                                                 1.00   687.5±25.80µs   190.0 MB/sec
roundtrip/dict/8192x8          1.01  1325.5±65.68µs   197.1 MB/sec    1.00  1308.0±45.54µs   199.8 MB/sec
roundtrip/fixed/65536x1        1.01    308.2±3.29µs  1622.9 MB/sec    1.00    305.6±4.94µs  1636.5 MB/sec
roundtrip/fixed/65536x16                                              1.00      6.7±0.18ms  1192.1 MB/sec
roundtrip/fixed/65536x4                                               1.00  1227.5±60.18µs  1629.6 MB/sec
roundtrip/fixed/65536x8        1.02      2.1±0.03ms  1880.7 MB/sec    1.00      2.1±0.10ms  1922.6 MB/sec
roundtrip/fixed/8192x1         1.00     89.1±1.29µs   702.8 MB/sec    1.03     91.6±1.07µs   683.2 MB/sec
roundtrip/fixed/8192x16                                               1.00   666.1±17.36µs  1503.4 MB/sec
roundtrip/fixed/8192x4                                                1.00    198.5±2.38µs  1261.2 MB/sec
roundtrip/fixed/8192x8         1.00    328.2±4.59µs  1525.5 MB/sec    1.02    333.5±3.15µs  1501.5 MB/sec
roundtrip/nested/65536x1       1.02   854.9±53.14µs  1462.3 MB/sec    1.00   841.7±44.42µs  1485.2 MB/sec
roundtrip/nested/65536x16                                             1.00     20.2±0.98ms   990.8 MB/sec
roundtrip/nested/65536x4                                              1.00      3.1±0.19ms  1593.3 MB/sec
roundtrip/nested/65536x8       1.00     10.8±0.67ms   924.3 MB/sec    1.07     11.6±0.75ms   863.1 MB/sec
roundtrip/nested/8192x1        1.00    159.2±6.06µs   983.0 MB/sec    1.00    159.2±5.22µs   982.8 MB/sec
roundtrip/nested/8192x16                                              1.00  1797.0±85.29µs  1393.0 MB/sec
roundtrip/nested/8192x4                                               1.00   479.2±28.79µs  1306.0 MB/sec
roundtrip/nested/8192x8        1.03   930.8±63.29µs  1344.7 MB/sec    1.00   906.9±53.55µs  1380.1 MB/sec
roundtrip/variable/65536x1     1.00  1299.0±83.27µs  1732.3 MB/sec    1.13  1466.0±112.68µs  1534.9 MB/sec
roundtrip/variable/65536x16                                           1.00     31.3±1.00ms  1149.4 MB/sec
roundtrip/variable/65536x4                                            1.00      7.6±0.35ms  1185.0 MB/sec
roundtrip/variable/65536x8     1.00     15.9±0.76ms  1130.8 MB/sec    1.11     17.6±0.97ms  1021.0 MB/sec
roundtrip/variable/8192x1      1.00    203.5±5.47µs  1383.1 MB/sec    1.01    206.3±6.08µs  1364.1 MB/sec
roundtrip/variable/8192x16                                            1.00      2.7±0.15ms  1669.1 MB/sec
roundtrip/variable/8192x4                                             1.00   662.3±33.08µs  1699.8 MB/sec
roundtrip/variable/8192x8      1.00  1209.5±25.57µs  1861.5 MB/sec    1.10  1331.0±46.97µs  1691.5 MB/sec

Resource Usage

base (merge-base)

Metric Value
Wall time 335.1s
Peak memory 95.4 MiB
Avg memory 37.8 MiB
CPU user 341.0s
CPU sys 72.3s
Peak spill 0 B

branch

Metric Value
Wall time 675.1s
Peak memory 203.1 MiB
Avg memory 54.2 MiB
CPU user 667.9s
CPU sys 148.2s
Peak spill 0 B

File an issue against this benchmark runner

@Rich-T-kid

Copy link
Copy Markdown
Contributor Author

similar story here. encode is path shows good improvements but the roundtrip is the same or slightly worse.
Going to start looking at the decode path, if the scope of that grows too big ill split it into a separate PR

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-flight Changes to the arrow-flight crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants