Bug report
Bug description:
The pyperformance unpack_sequence benchmark is the worst-performing JIT benchmark on https://www.doesjitgobrrr.com — geomean speedup 0.586× (JIT ~1.7× slower).
The benchmark is a single function with 400 inlined a,b,c,d,e,f,g,h,i,j = to_unpack statements inside a for loop. to_unpack = tuple(range(10)) is reused (refcount > 1, so _UNPACK_SEQUENCE_UNIQUE_TUPLE never fires).
Analysis of the slowdown
The JIT covers all 400 unpacks via ~22 sequential traces (~775 uops, ~19 unpacks each), linked tail-to-tail through _EXIT_TRACE → _START_EXECUTOR, totalling ~18 MB of JIT code. This is a consequence of trace length / fitness limits ([UOP_MAX_TRACE_LENGTH] https://github.com/python/cpython/blob/main/Include/internal/pycore_uop.h#L42), EXIT_QUALITY_* from gh-146073) and is arguably fine - 400 inline statements isn't realistic code. Inter-trace transition cost alone would not produce a 1.7× slowdown. Nevertheless, longer traces would help here.
Per unpack: ~33 uops vs tier 1's ~12 bytecodes — ~2.7× more dispatches. The trace recorder unconditionally emits a _CHECK_VALIDITY + _SET_IP pair before every source bytecode at Python/optimizer.c:902-906. Per unpack:
LOAD_FAST_BORROW t (1)
UNPACK_SEQUENCE 10 (1)
STORE_FAST_STORE_FAST × 5 (5; pair-fused by the compiler)
── 7 source bytecodes → 6 _CHECK_VALIDITY+_SET_IP pairs steady state
Tier 1 has no analog. Each _CHECK_VALIDITY issues a load+branch on current_executor->vm_data.valid; each _SET_IP issues a store to frame->instr_ptr. 12 such uops × 400 unpacks × 20000 iterations is the bulk of the regression.
A naive elimination pass cannot drop them because every gap contains at least one uop with HAS_ESCAPES_FLAG — typically _POP_TOP, conservatively flagged as escaping (its Py_DECREF could run __del__).
CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux
Bug report
Bug description:
The pyperformance
unpack_sequencebenchmark is the worst-performing JIT benchmark on https://www.doesjitgobrrr.com — geomean speedup 0.586× (JIT ~1.7× slower).The benchmark is a single function with 400 inlined
a,b,c,d,e,f,g,h,i,j = to_unpackstatements inside aforloop.to_unpack = tuple(range(10))is reused (refcount > 1, so_UNPACK_SEQUENCE_UNIQUE_TUPLEnever fires).Analysis of the slowdown
The JIT covers all 400 unpacks via ~22 sequential traces (~775 uops, ~19 unpacks each), linked tail-to-tail through
_EXIT_TRACE → _START_EXECUTOR, totalling ~18 MB of JIT code. This is a consequence of trace length / fitness limits ([UOP_MAX_TRACE_LENGTH] https://github.com/python/cpython/blob/main/Include/internal/pycore_uop.h#L42),EXIT_QUALITY_*from gh-146073) and is arguably fine - 400 inline statements isn't realistic code. Inter-trace transition cost alone would not produce a 1.7× slowdown. Nevertheless, longer traces would help here.Per unpack: ~33 uops vs tier 1's ~12 bytecodes — ~2.7× more dispatches. The trace recorder unconditionally emits a
_CHECK_VALIDITY+_SET_IPpair before every source bytecode at Python/optimizer.c:902-906. Per unpack:Tier 1 has no analog. Each
_CHECK_VALIDITYissues a load+branch oncurrent_executor->vm_data.valid; each_SET_IPissues a store toframe->instr_ptr. 12 such uops × 400 unpacks × 20000 iterations is the bulk of the regression.A naive elimination pass cannot drop them because every gap contains at least one uop with
HAS_ESCAPES_FLAG— typically_POP_TOP, conservatively flagged as escaping (itsPy_DECREFcould run__del__).CPython versions tested on:
CPython main branch
Operating systems tested on:
Linux