Skip to content

ffi: add experimental fast FFI call API #63068

Draft
ShogunPanda wants to merge 1 commit intonodejs:mainfrom
ShogunPanda:fast-ffi
Draft

ffi: add experimental fast FFI call API #63068
ShogunPanda wants to merge 1 commit intonodejs:mainfrom
ShogunPanda:fast-ffi

Conversation

@ShogunPanda
Copy link
Copy Markdown
Contributor

@ShogunPanda ShogunPanda commented May 1, 2026

Review Guide: Fast FFI

TODO

  • Add trampolines for all other platforms

Summary

This PR adds a V8 Fast API-backed call path for experimental node:ffi.

Fast FFI is not a separate user-facing feature or flag. It is used automatically
for eligible signatures when --experimental-ffi is enabled. Unsupported
signatures continue through the SharedBuffer or generic libffi paths.

For implementation details, see:

  • doc/contributing/ffi-fast-api-internals.md

What Changed

Native implementation:

  • src/ffi/fast.h
  • src/ffi/fast.cc
  • src/ffi/platforms/*.cc
  • src/node_ffi.cc
  • src/node_ffi.h
  • src/env_properties.h

JavaScript routing:

  • lib/ffi.js
  • lib/internal/ffi/fast-api.js
  • lib/internal/ffi-shared-buffer.js

Tests, benchmarks, docs:

  • test/ffi/test-ffi-fast-buffer.js
  • test/ffi/test-ffi-shared-buffer.js
  • test/ffi/test-ffi-calls.js
  • benchmark/ffi/*.js
  • doc/api/ffi.md
  • doc/contributing/ffi-fast-api-internals.md

Key Design Points

  • Fast API metadata is generated from runtime FFI signatures.
  • Native trampoline codegen is handled by specialized platform generators in src/ffi/platforms/*.cc.
  • Generated trampolines adapt V8 Fast API calls to native FFI target calls.
  • Fast API is tried first; unsupported signatures fall back.
  • SharedBuffer remains a separate optimized fallback.
  • lib/ffi.js owns wrapper orchestration.
  • lib/internal/ffi-shared-buffer.js owns only SharedBuffer wrapping.
  • lib/internal/ffi/fast-api.js owns Fast API pointer/string/buffer conversions.
  • Fast API and SharedBuffer use separate internal Symbols.
  • Single pointer-like signatures keep both scalar pointer fast calls and a secondary buffer-shaped fast call.

Reviewer Focus

Please pay particular attention to:

  • signature eligibility and fallback behavior
  • specialized platform trampoline ABI correctness
  • lifetime of FastFFIMetadata
  • executable memory allocation and cleanup
  • pointer, Buffer, ArrayBuffer, and string conversion behavior
  • separation between Fast API and SharedBuffer metadata
  • wrapper behavior in lib/ffi.js
  • preservation of name, length, and pointer on wrappers

Correctness Checklist

  • Unsupported signatures never partially enter the Fast API path.
  • Generic fallback remains available.
  • Closed libraries are detected before native invocation.
  • i64, u64, and pointer values preserve BigInt behavior.
  • Narrow integer sign/zero extension is correct.
  • f32/f64 preserve NaN, infinities, and -0.
  • Buffer views account for byteOffset.
  • Invalid buffer inputs throw coded Node errors.
  • Strings reject embedded NUL bytes.
  • Temporary string buffers remain alive for the call.
  • Internal Symbol metadata does not leak onto wrappers.

Disclaimer

Assisted-By: OpenAI:GPT-5.5 <openai/gpt-5.5>

@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

Review requested:

  • @nodejs/gyp
  • @nodejs/security-wg

@ShogunPanda ShogunPanda marked this pull request as draft May 1, 2026 15:39
@nodejs-github-bot nodejs-github-bot added build Issues and PRs related to build files or the CI. dependencies Pull requests that update a dependency file. needs-ci PRs that need a full CI run. labels May 1, 2026
@ShogunPanda ShogunPanda changed the title Fast ffi ffi: add experimental fast FFI call API May 1, 2026
@ShogunPanda ShogunPanda added the ffi Issues and PRs related to experimental Foreign Function Interface support. label May 1, 2026
@bengl
Copy link
Copy Markdown
Member

bengl commented May 1, 2026

At first glance, some observations:

  1. There are two big changes happening here. One is using V8 Fast API, the other is using cranelift. These probably can and probably should be two separate PRs, since they're not particularly dependent on each other.
  2. Why bother with the --experimental-fast-ffi flag? No one wants slow FFI. If cranelift etc. is already compiled in, and it's compatible with the platform, why treat it as a separate experimental feature from FFI itself? It's just an implementation detail.
  3. In my old experiments from 4 years ago, I had some trouble with V8 Fast FFI. In particular, paradoxically, I found it not particularly fast for my benchmarks compared to non-fast. YMMV, and things have changed dramatically since then, so let's see how this shakes out with benchmarks.
  4. Cranelift is not small, and you're using it here to do functionally the same thing as libffi does, but with more code on the Node.js side to maintain. It seems like we ought to pick one or the other, not optionally both. If we want to go down the road of using a JIT compiler to build trampolines, I'm curious how this compares against something like cjit/TCC. It would be good to see benchmarks there, also comparing against the current libffi approach.

Comment thread src/node_ffi.cc Outdated
ffi_args_heap.resize(nargs);
values = values_heap.data();
ffi_args = ffi_args_heap.data();
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is exactly what MaybeStackBuffer is there for

Comment thread src/ffi/fast.cc
}

return true;
}
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

C++ style: This should return std::optional<std::pair<FastFFIType, CTypeInfo>>

Comment thread src/node_ffi.h Outdated
Comment on lines +34 to +37
std::shared_ptr<void> fast_code;
std::vector<v8::CTypeInfo> fast_arg_info;
std::unique_ptr<v8::CFunctionInfo> fast_function_info;
std::unique_ptr<v8::CFunction> fast_c_function;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to leave a TODO for me to clean up the allocation management here, having 10+ separate heap allocations for each function seems like a lot

Comment thread src/node_options.h Outdated
@@ -129,6 +129,9 @@ class EnvironmentOptions : public Options {
bool experimental_addon_modules = EXPERIMENTALS_DEFAULT_VALUE;
bool experimental_eventsource = EXPERIMENTALS_DEFAULT_VALUE;
bool experimental_ffi = EXPERIMENTALS_DEFAULT_VALUE;
#if HAVE_FAST_FFI
bool experimental_fast_ffi = EXPERIMENTALS_DEFAULT_VALUE;
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to echo what @bengl said – It seems like having the flag available unconditionally would not break anything and just make things easier (e.g. save you the file reexecution jumps you're hooping through in the tests).

Comment thread deps/crates/src/node_fast_ffi.rs Outdated
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is first-party Node.js core code, right? It probably shouldn't live in deps/ in the long run

Comment thread doc/api/ffi.md Outdated
allocate a temporary UTF-8 copy. For performance-sensitive C string APIs, encode
the string before invoking the native function, for example with `TextEncoder`,
and declare the parameter as `buffer` or `arraybuffer`. Include the trailing
`\0` byte when the native API expects a NUL-terminated string.
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... but that's also a temporary UTF-8 copy, just like passing a string directly would have been?

Comment thread src/ffi/fast.cc Outdated
kBuffer = 12,
};

bool ToToFastFFIType(ffi_type* type,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the double To intentional?

Comment thread src/ffi/fast.cc Outdated
};

bool ToToFastFFIType(ffi_type* type,
const std::string& type_name,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const std::string& type_name,
std::string_view type_name,

Comment thread src/node_ffi.cc Outdated
#if HAVE_FAST_FFI
PrepareFastFunction(env, fn.get());
const CFunction* fast_c_function = fn->fast_c_function.get();
#endif
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#endif
#else
const CFunction* fast_c_function = nullptr;
#endif

that lets you get rid of the much larger conditional below here

@ShogunPanda
Copy link
Copy Markdown
Contributor Author

At first glance, some observations:

  1. There are two big changes happening here. One is using V8 Fast API, the other is using cranelift. These probably can and probably should be two separate PRs, since they're not particularly dependent on each other.

Unfortunately that's not the case, as far as I understood this problem.

V8 Fast API optimize JS -> C++ entry, Cranelift generates the native wrapper that performs the ABI-correct call to the FFI target.

FFI signatures are declared at runtime, while V8 Fast API requires a concrete native signature for each fast callable. Cranelift is what turns the runtime FFI signature into such a concrete callable.

A libffi-only Fast API path is possible, but only for a finite set of predefined C++ wrapper signatures, and it would still route through ffi_call().

That would not provide the universal fast path this PR is trying to introduce.

  1. Why bother with the --experimental-fast-ffi flag? No one wants slow FFI. If cranelift etc. is already compiled in, and it's compatible with the platform, why treat it as a separate experimental feature from FFI itself? It's just an implementation detail.

@addaleax Also concurred on this below. I'll remove it.

  1. In my old experiments from 4 years ago, I had some trouble with V8 Fast FFI. In particular, paradoxically, I found it not particularly fast for my benchmarks compared to non-fast. YMMV, and things have changed dramatically since then, so let's see how this shakes out with benchmarks.

I'll attach some benchmarks tomorrow so we can compare.

  1. Cranelift is not small, and you're using it here to do functionally the same thing as libffi does, but with more code on the Node.js side to maintain. It seems like we ought to pick one or the other, not optionally both. If we want to go down the road of using a JIT compiler to build trampolines, I'm curious how this compares against something like cjit/TCC. It would be good to see benchmarks there, also comparing against the current libffi approach.

As far as I understand, TCC is LGPL which is not usable in Node.js? Am I wrong?

@addaleax
Copy link
Copy Markdown
Member

addaleax commented May 2, 2026

while V8 Fast API requires a concrete native signature for each fast callable

Does it? I haven't tried it out myself, but there are

CFunction(const void* address, const CFunctionInfo* type_info);
CFunctionInfo(const CTypeInfo& return_info, unsigned int arg_count,
              const CTypeInfo* arg_info,
              Int64Representation repr = Int64Representation::kNumber);

constructors available, which should allow constructing CFunction instances with runtime-supplied type information, no?

@ShogunPanda
Copy link
Copy Markdown
Contributor Author

I get a little confused here. I guess you're right, but what are they invoking? How are the target functions built?

@addaleax
Copy link
Copy Markdown
Member

addaleax commented May 3, 2026

@ShogunPanda Yeah, so, looking at the code in fast.cc, we're already using those as I would expect ... I guess my question is, do we think the complexity introduced by the cranelift wrapper is justified, given that we can already easily cover a fairly broad range directly through V8's own fast API support?

Like @bengl said, the wrapper logic and its (massive) scaffolding is fairly independent from the core V8 fast call integration, and making these separate PRs (and separate decisions) seems wise.

@addaleax
Copy link
Copy Markdown
Member

addaleax commented May 3, 2026

I guess you're right, but what are they invoking? How are the target functions built?

As for your questions – they are invoking native functions living in the process's memory, and they are typically built with a compiler. But these don't seem like actual answers to your questions, so I'm not sure I understand what you're saying here

@ShogunPanda
Copy link
Copy Markdown
Contributor Author

ShogunPanda commented May 4, 2026

@addaleax

After a brainstorm sessing with @bengl I finally got a confirmation of my interpretation of your request and run a local spike.

I checked the direct CFunction(address, CFunctionInfo*) path. It does not work for plain FFI symbols because V8 Fast API signatures include the JS receiver as the first C argument.

A native FFI symbol such as int32_t(int32_t) therefore does not match a JS call with one argument; V8 expects a fast callback shaped like int32_t(Local receiver, int32_t).

So direct V8 Fast API can use runtime type info, but it still requires an embedder-compatible wrapper. For runtime FFI signatures that means either a finite set of predefined wrappers or generated trampolines.

Since I want to have a the "most universal solution" possible, I don't want to introduce predefined wrappers. I've evaluated other possible solutions but so far Cranelift seems to be the only viable.

Do you concur on this or am I missing anything?

@addaleax
Copy link
Copy Markdown
Member

addaleax commented May 4, 2026

@ShogunPanda

It does not work for plain FFI symbols because V8 Fast API signatures include the JS receiver as the first C argument.

Is that requirement made explicit or documented anywhere? I did try manually to remove the receiver argument from some of the Node.js built-in fast API call functions, and it didn't seem to make a difference (obviously this only works if the second argument isn't also a Local<>).

Since I want to have a the "most universal solution" possible, I don't want to introduce predefined wrappers. I've evaluated other possible solutions but so far Cranelift seems to be the only viable.

I think it's still worth thinking about ways in which to remove restrictions on the V8 side (which, yes, that has annoying implications around timelines because it's non-trivial upstream work), but it seems like something that would be significantly cleaner in the medium term

@ShogunPanda
Copy link
Copy Markdown
Contributor Author

@addaleax Can you point me where you successfully removed it? I can try something similar.

@addaleax
Copy link
Copy Markdown
Member

addaleax commented May 5, 2026

@ShogunPanda Hm, it looks like this just worked "silently" because without Local<Value> receiver as the first parameter, the C++ source would compile fine and V8 would accept the signature, but it would not actually invoke the fast call variant.

You're right that "shifting" away the first argument cannot really be done without some runtime/JIT compilation. I don't know if Cranelift is worth the overhead, since we're using it for a very very specific use case, and it would be possible to implement this for x64/arm64 ourselves, but I also see how an actual compiler library is something worth thinking about in that case.

I think I'd still have a mild preference for seeing if there are ways to achieve the same goals by collaborating with the V8 team. Removing the need for a receiver argument, for example, should not be too complex (other than the requirement to modify V8 for this).

@bengl
Copy link
Copy Markdown
Member

bengl commented May 5, 2026

I think given all that, for now, the best move is still to start with a PR that adds V8 Fast Calls alone, with no other changes, and then hold off on the rest until the modifying V8 is explored. @ShogunPanda SGTY?

@ShogunPanda
Copy link
Copy Markdown
Contributor Author

@bengl @addaleax I'm currently exploring using MIR instead of Cranelift, which is WAY smaller.

Adding V8 Fast API now it would be useless unless we only enable a very narrow set of specialized and handwritten helpers. Which is not something I would like to do.

Signed-off-by: Paolo Insogna <paolo@cowtech.it>
Assisted-By: OpenAI:GPT-5.5 <openai/gpt-5.5>
@ShogunPanda
Copy link
Copy Markdown
Contributor Author

Benchmarks ony my machine (Apple M2 Max on MacOS 26):

ffi/add-64.js n=10000000                                   ***   3055.59 %      ±24.62%  ±33.18%  ±44.05%
ffi/add-f32.js n=10000000                                  ***   3294.28 %      ±20.39%  ±27.48%  ±36.48%
ffi/add-i16.js n=10000000                                  ***   2695.44 %      ±25.29%  ±34.08%  ±45.25%
ffi/add-i32.js n=10000000                                  ***   3064.68 %      ±21.53%  ±29.02%  ±38.52%
ffi/add-i64.js n=10000000                                  ***   3302.07 %      ±47.30%  ±63.75%  ±84.62%
ffi/add-i8.js n=10000000                                   ***   2615.40 %      ±41.19%  ±55.51%  ±73.70%
ffi/add-u16.js n=10000000                                  ***   2652.67 %      ±96.77% ±130.42% ±173.15%
ffi/add-u64.js n=10000000                                  ***   3520.18 %      ±25.17%  ±33.92%  ±45.03%
ffi/add-u8.js n=10000000                                   ***   2566.88 %      ±88.23% ±118.91% ±157.87%
ffi/buffer-first-byte-direct.js n=10000000                 ***    866.43 %       ±7.53%  ±10.13%  ±13.39%
ffi/buffer-first-byte.js n=10000000                        ***    795.50 %       ±4.54%   ±6.11%   ±8.10%
ffi/buffer-sum-direct.js n=10000000                        ***    789.80 %       ±7.46%  ±10.04%  ±13.30%
ffi/buffer-sum.js n=10000000                               ***    802.62 %       ±5.58%   ±7.52%   ±9.98%
ffi/getpid.js n=10000000                                   ***   1351.43 %      ±44.12%  ±59.45%  ±78.92%
ffi/identity-i32.js n=10000000                             ***   2420.69 %      ±16.71%  ±22.50%  ±29.84%
ffi/many-args.js n=10000000                                ***   4880.80 %      ±35.29%  ±47.56%  ±63.14%
ffi/noop-void.js n=10000000                                ***    790.98 %       ±9.68%  ±13.04%  ±17.31%
ffi/pointer-bigint.js n=10000000                           ***   1082.11 %       ±6.52%   ±8.78%  ±11.64%
ffi/pointer-buffer-direct.js n=10000000                    ***    882.63 %       ±2.52%   ±3.39%   ±4.49%
ffi/pointer-buffer.js n=10000000                           ***    858.21 %       ±6.12%   ±8.25%  ±10.93%
ffi/pointer-null.js n=10000000                             ***   1991.44 %       ±6.58%   ±8.87%  ±11.76%
ffi/string-equals-hello-buffer-direct.js n=10000000        ***    680.82 %       ±2.55%   ±3.44%   ±4.55%
ffi/string-equals-hello-buffer.js n=10000000               ***    699.65 %       ±3.37%   ±4.52%   ±5.94%
ffi/string-first-char-buffer-direct.js n=10000000          ***    861.77 %       ±4.74%   ±6.37%   ±8.44%
ffi/string-first-char-buffer.js n=10000000                 ***    790.11 %       ±3.91%   ±5.26%   ±6.98%
ffi/string-length-buffer-direct.js n=10000000              ***    808.66 %       ±2.40%   ±3.23%   ±4.27%
ffi/string-length-buffer.js n=10000000                     ***    871.50 %       ±4.70%   ±6.33%   ±8.38%
ffi/string-length-string-direct.js n=10000000              ***    809.19 %       ±1.14%   ±1.52%   ±1.98%
ffi/string-length-string.js n=10000000                     ***   1184.05 %      ±14.64%  ±19.73%  ±26.19%
ffi/sum-3-i32.js n=10000000                                ***   3732.15 %      ±24.49%  ±33.01%  ±43.82%
ffi/sum-5-i32.js n=10000000                                ***   4582.52 %      ±24.00%  ±32.34%  ±42.93%
ffi/sum-8-i32.js n=10000000                                        -0.06 %       ±0.37%   ±0.49%   ±0.64%

@ShogunPanda
Copy link
Copy Markdown
Contributor Author

@bengl @addaleax Please re-evaluate this. I followed Anna's suggestion and applied direct argument removal.
So far I only implemented it for arm64 (my machine native arch) but we can easily extend to other archs as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

build Issues and PRs related to build files or the CI. dependencies Pull requests that update a dependency file. ffi Issues and PRs related to experimental Foreign Function Interface support. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants