Releases · raullenchai/Rapid-MLX

29 Apr 13:34

v0.6.4

33ea542

Latest

🚀 Day-0 DeepSeek-V4-Flash support

First MLX backend with day-0 DeepSeek-V4-Flash (158B-A13B, 1M context) support — released by DeepSeek 2026-04-24, runnable on Mac Studio Ultra today.

What's new

rapid-mlx serve deepseek-v4-flash — points to the 8-bit variant (155 GB on disk, ~136 GB peak RAM, fits 192 GB+ Macs). Also deepseek-v4-flash-2bit for 128 GB Macs and deepseek-v4-flash-4bit for distributed setups.
Vendored architecture from mlx-lm PR #1192 (Prince Canuma / Blaizzy) — registered transparently via sys.modules['mlx_lm.models.deepseek_v4']. mlx-lm 0.32+ will eventually merge native support; we'll drop the vendor at that point.
chat_template.jinja fallback in the tokenizer loader — DeepSeek V4 ships its template as a separate file rather than embedding in tokenizer_config.json. UTF-8 BOM-safe.
Vendored-arch routing: detects model_type in our vendored set upfront and bypasses transformers' AutoConfig/PreTrainedConfig paths (which trip on the unknown architecture's RoPE standardization).

Performance — Mac Studio M3 Ultra (256 GB)

Variant	Decode	Prefill	TTFT cold	Peak RAM	Disk
2-bit DQ	56 tok/s	443 tok/s	0.78 s	91 GB	90 GB
8-bit	31 tok/s	415 tok/s	2.61 s	136 GB	145 GB

Stress test: 7/8 scenarios pass on both quants (sustained throughput, concurrent load, long generation, rapid fire, mixed workload, disconnect resilience, memory stability).

Known limitation — tool calling

Tool calling currently scores 0/30 on our 30-scenario eval (evals/run_eval.py --suite tool_calling). Root cause is upstream: the chat template that mlx-community ships for DeepSeek V4 only handles system/user/assistant — no tool role rendering, no tools array iteration, no <tool_call> markers. Tools are silently dropped before reaching the model.

Plain chat works perfectly. For agentic use today, recommend Qwen3.6-35B (100% tool calling rate per SCORECARD). DeepSeek V4 tool template work is tracked as a follow-up.

⚠️ Upgrade note for existing Homebrew users

brew upgrade from 0.6.3 → 0.6.4 may fail with a link conflict because the 0.6.3 install left a non-symlink file at /opt/homebrew/bin/rapid-mlx. If you hit Error: rapid-mlx ... already exists, run:

brew link --overwrite raullenchai/rapid-mlx/rapid-mlx

New brew installs (brew install raullenchai/rapid-mlx/rapid-mlx) are unaffected.

Other

Tracking PR: #168 (squashed)
Vendored arch will be removed once mlx-lm 0.32+ ships native deepseek_v4
4 new unit tests covering vendoring + Metal kernel forward pass
Onboarding-tested: clean install via pip install rapid-mlx==0.6.4 and via Homebrew tap

Install:

pip install --upgrade rapid-mlx==0.6.4
# or
brew upgrade raullenchai/rapid-mlx/rapid-mlx

🤖 Generated with Claude Code

Assets 2

29 Apr 00:14

raullenchai

v0.6.3

d0574f4

v0.6.3 — clean shutdown bundle

Highlights

Single-fix patch release. Shutdown noise on hybrid models is gone.

Recommended for users on 0.6.2 who saw `Stream(gpu, N)` warnings or trailing tracebacks at `Application shutdown complete`.

Fixes

#167 — Route prefix-cache persistence and `BatchGenerator` teardown through the mlx-step worker thread that owns the per-thread `generation_stream`. Eliminates two flavors of `RuntimeError: There is no Stream(gpu, N) in current thread.` left over after the #161 hot-path fix:
1. Prefix-cache `save_to_disk` on shutdown was running on the asyncio loop thread; ~5/119 entries dropped on Qwen3.6-27B 4-bit. Now routed through the worker. 117/117 entries persist.
2. `BatchGenerator.del` (mlx-lm) called `mx.synchronize` from arbitrary GC threads after the worker had already exited. Now closed explicitly on the worker before executor shutdown.

Discovered during 0.6.2 onboarding QA — running the exact #166 user repro on a fresh PyPI install surfaced both bugs.

Verification (Qwen3.6-27B 4-bit)

Metric	0.6.2	0.6.3
Generation tok/s	37	37 (no change)
Cache entries persisted on SIGINT	114/119 (96%)	117/117 (100%)
`Stream(gpu, N)` warnings during persist	5	0
Trailing `RuntimeError` from `BatchGenerator.del`	1	0

Upgrade

```bash
pip install -U rapid-mlx
```

Full changelog

v0.6.2...v0.6.3

Assets 2

28 Apr 23:09

raullenchai

v0.6.2

9603a17

v0.6.2 — mlx-lm 0.31+ correctness bundle

Highlights

Four correctness fixes for users on mlx-lm 0.31+. Hybrid models (Qwen3.5 / Qwen3.6) were silently missing the prefix cache forever, and the BatchedEngine worker could hit RuntimeError: There is no Stream(gpu, N) in current thread on first generation. Both fixed.

Recommended upgrade for everyone on 0.6.1.

Fixes

#161 — MLX generation stream thread ownership (fixes #160, #166). The dedicated mlx-step worker thread now creates and owns its own mx.new_stream(...) so mlx_lm.generate.generation_stream resolves correctly. Without this, every generation call returned 1 token at 0.0 tok/s and silently logged the stream error. Thanks @samuelfaj.
#163 / #165 — Restore prompt-boundary cache snapshot for hybrid models on mlx-lm 0.31+. PR #105 dropped the snapshot when gating out the legacy _install_chunked_prefill monkey-patch; hybrid models (Qwen3.5/3.6 DeltaNet+attention) consequently MISS the prefix cache on every request. Reimplemented using only public mlx-lm API: PromptProcessingBatch.Response.end_of_prompt + BatchGenerator.extract_cache(uids). Verified end-to-end: cold 91ms → warm 24ms (3.81× speedup) on Qwen3.5-9B. Thanks @luizribeiro for the report.
#164 — Cleanup of #161 (move import sys to module top, drop dead defensive rebind).
Thread-safe MemoryAwarePrefixCache — pre-existing race between the asyncio fetch path and the worker-thread store/evict path could KeyError on _entries[k] when an _evict_lru reordered before the read. Added a threading.Lock and reordered eviction (_remove_from_sorted before _entries.pop).

Upgrade

pip install -U rapid-mlx

Full changelog

v0.6.1...v0.6.2

Contributors

luizribeiro and samuelfaj

Assets 2

22 Apr 18:13

raullenchai

v0.6.1

972437f

v0.6.1 — Qwen3.6 Day 0 Support

Qwen3.6 Day 0 Support

Run Qwen3.6 on your Mac in one command:

pip install -U rapid-mlx
rapid-mlx serve qwen3.6-27b    # dense 27B, 14.9GB, 36.5 tok/s
rapid-mlx serve qwen3.6-35b    # MoE 35B-A3B, 19GB, 92 tok/s

New Aliases

Alias	Model	RAM	Speed
`qwen3.6-27b`	mlx-community/Qwen3.6-27B-4bit	14.9 GB	36.5 tok/s
`qwen3.6-27b-8bit`	unsloth/Qwen3.6-27B-MLX-8bit	32.3 GB	18.9 tok/s
`qwen3.6-35b-6bit`	mlx-community/Qwen3.6-35B-A3B-6bit	~28 GB	~72 tok/s

Highlights

Qwen3.6-35B: 12% faster than Qwen3.5-35B (92 vs 82 tok/s)
Qwen3.6-27B: Dense hybrid (64 layers, DeltaNet + Attention), 262K native context, vision
Auto-detected parser: qwen3_coder_xml — just rapid-mlx serve qwen3.6-27b, parsers auto-configured
Coding: 100% on eval suite
Stress test: 8/8 PASS

Also in this release

TurboQuant KV cache compression (#157)

Assets 2

20 Apr 03:00

raullenchai

v0.6.0

b9d1648

v0.6.0 — Gemma 4 fix, logprobs, server decomposition

What's New in v0.6.0

Bug Fixes

Fix Gemma 4 degeneration — All Gemma 4 models now produce correct output (was infinite repetition). Root cause: custom model wrapper incompatible with mlx-lm 0.31. (#148)
Streaming JSON mode thinking leak — response_format with stream=true no longer leaks thinking preamble (#46)
Per-request parser instances — Concurrent BatchedEngine requests no longer corrupt each other's reasoning/tool parser state (#P1)
Per-request sampler — Temperature/top_p now correctly applied per request (was using default argmax)

New Features

logprobs support — logprobs: true + top_logprobs: N now returns per-token log-probabilities in both streaming and non-streaming modes (#44)
rapid-mlx doctor — User-facing self-diagnostic (Metal GPU, imports, CLI, model load check). Works from pip install, no dev tools needed.
Dev test suite — make lint/test/smoke/stress/soak for developers. 2100+ unit tests.
Pipeline architecture — DecodeStrategy / DecodePlugin interfaces for future optimization plugins (TurboQuant, speculative decode)

Architecture

server.py decomposition — 4025 → 1047 lines (-74%) across 4 PRs
- Routes extracted: routes/chat.py, routes/completions.py, routes/anthropic.py
- Helpers extracted: service/helpers.py
- Cache extracted: runtime/cache.py
- PostProcessor: 100% test coverage (253 statements)
Architecture docs — docs/architecture.md with module map and pipeline design

Testing

Stress tested across 7 model architectures: Gemma 4, Qwen3.6, Qwen3.5, Phi-4, Devstral, Llama, Mistral
10-minute agent soak tests: 287+ requests, 0 errors
2100+ unit tests, 0 regressions

Full Changelog

v0.5.3...v0.6.0

Assets 2

17 Apr 18:02

raullenchai

v0.5.10

07e84a8

v0.5.10 — Qwen 3.6 Day-0 (stable)

Qwen 3.6-35B-A3B — Day-0 Support (Stable)

This release wraps up comprehensive testing and fixes for Qwen 3.6 support. 6 rounds of persona onboarding tests, stress tests, and chaos tests — all passing.

pip install -U rapid-mlx
rapid-mlx serve qwen3.6-35b

Benchmark (M3 Ultra)

Model	Decode	TTFT	Tool Calling	VRAM
Qwen3.6-35B-A3B (4-bit)	95 tok/s	219ms	100%	20 GB
Qwen3.5-4B (4-bit)	160 tok/s	187ms	100%	2.4 GB

Changes since v0.5.2

Qwen 3.6 support:

Auto-config: qwen3_coder_xml parser (XML tool format, different from Qwen3.5)
Alias: rapid-mlx serve qwen3.6-35b
Added to all 11 agent profiles + doctor full tier

Stability & UX (8 releases of fixes):

Thinking token budget: small max_tokens no longer truncated by reasoning
reasoning_content field (OpenAI-compatible) with backward-compat reasoning alias
model field now optional (defaults to loaded model)
Agent setup auto-detects running model (rapid-mlx agents hermes)
Concurrent request serialization (prevents Metal crash in SimpleEngine)
__version__ syncs with PyPI automatically
Improved JSON mode compliance
Docs fully rebranded to rapid-mlx
rapid-mlx models shows size/speed/recommended Mac

Full Changelog: v0.5.2...v0.5.10

Assets 2

17 Apr 16:59

raullenchai

v0.5.9

b163117

v0.5.9 — Performance regression fix

Fixes

Thinking budget now respects `enable_thinking: false`

Previously, the +2048 thinking token budget was always applied when a reasoning parser was active, even when enable_thinking: false was set. This caused:

Requests with enable_thinking: false to still generate 100+ tokens instead of the requested 30
Progressive slowdown: 1s → 21s over 50 requests (tokens accumulated in prompt cache)
6.6x total time wasted (675s → 102s for 50 requests)

Before: max_tokens=30 + enable_thinking=false → internal budget 2078, avg 13.5s
After: max_tokens=30 + enable_thinking=false → internal budget 30, avg 2.0s

Truncated reasoning handled correctly

<think> without </think> (high temp, max_tokens truncation) → classified as reasoning_content, not leaked into content

Full Changelog: v0.5.8...v0.5.9

Assets 2

17 Apr 15:57

raullenchai

v0.5.8

b49af2d

v0.5.8 — Stability & reasoning fixes

Fixes

Reasoning parser handles truncated thinking

When <think> appears without </think> (high temperature, max_tokens truncation), content is now correctly classified as reasoning instead of leaking into the response content field
Previously: content: "<think>thinking process..." (broken)
Now: reasoning_content: "thinking process...", content: null (correct)

Concurrent request safety (SimpleEngine)

Added HTTP middleware to serialize /v1/chat requests when using SimpleEngine
Prevents Metal GPU crash (memory corruption of free block) when multiple clients (e.g., Hermes + Cursor) hit the server simultaneously
BatchedEngine is unaffected (handles concurrency via Scheduler)

Improved JSON mode compliance

response_format: json_object prompt now instructs model to follow requested structure (arrays, multiple items, exact keys)

Full Changelog: v0.5.7...v0.5.8

Assets 2

17 Apr 14:26

raullenchai

v0.5.7

7a0fa53

v0.5.7 — API cleanup + thinking budget

Fixes

Unified `reasoning_content` field (breaking API change)

Renamed reasoning → reasoning_content in all API responses (matches OpenAI's o1/o3 format)
Applies to both non-streaming (message.reasoning_content) and streaming (delta.reasoning_content)
Removed duplicate reasoning alias — responses are now ~50% smaller for thinking models

Larger thinking token budget

Budget increased from +1024 to +2048 tokens
Threshold raised from 2048 to 4096 — covers more requests
Complex coding prompts (fibonacci, debugging) no longer truncate

Version sync

__version__ now reads from package metadata instead of hardcoded string
python -c "import vllm_mlx; print(vllm_mlx.__version__)" now matches PyPI

Full Changelog: v0.5.6...v0.5.7

Assets 2

17 Apr 14:02

raullenchai

v0.5.6

c6bf752

v0.5.6 — Codex review fixes

Fixes

Response model field: No longer returns literal "default" — resolves to actual model name (e.g., mlx-community/Qwen3.6-35B-A3B-4bit)
max_tokens=0: No longer silently treated as None (was becoming 4096)
Thinking budget: Now applies based on resolved value, covering edge case where CLI --max-tokens is small and user omits the field

Full Changelog: v0.5.5...v0.5.6

Assets 2

Releases: raullenchai/Rapid-MLX

v0.6.4 — Day-0 DeepSeek V4 Flash

🚀 Day-0 DeepSeek-V4-Flash support

What's new

Performance — Mac Studio M3 Ultra (256 GB)

Known limitation — tool calling

⚠️ Upgrade note for existing Homebrew users

Other

Uh oh!

v0.6.3 — clean shutdown bundle

Highlights

Fixes

Verification (Qwen3.6-27B 4-bit)

Upgrade

Full changelog

Uh oh!

v0.6.2 — mlx-lm 0.31+ correctness bundle

Highlights

Fixes

Upgrade

Full changelog

Contributors

Uh oh!

v0.6.1 — Qwen3.6 Day 0 Support

Qwen3.6 Day 0 Support

New Aliases

Highlights

Also in this release

Uh oh!

v0.6.0 — Gemma 4 fix, logprobs, server decomposition

What's New in v0.6.0

Bug Fixes

New Features

Architecture

Testing

Full Changelog

Uh oh!

v0.5.10 — Qwen 3.6 Day-0 (stable)

Qwen 3.6-35B-A3B — Day-0 Support (Stable)

Benchmark (M3 Ultra)

Changes since v0.5.2

Uh oh!

v0.5.9 — Performance regression fix

Fixes

Thinking budget now respects enable_thinking: false

Truncated reasoning handled correctly

Uh oh!

v0.5.8 — Stability & reasoning fixes

Fixes

Reasoning parser handles truncated thinking

Concurrent request safety (SimpleEngine)

Improved JSON mode compliance

Uh oh!

v0.5.7 — API cleanup + thinking budget

Fixes

Unified reasoning_content field (breaking API change)

Larger thinking token budget

Version sync

Uh oh!

v0.5.6 — Codex review fixes

Fixes

Uh oh!

Thinking budget now respects `enable_thinking: false`

Unified `reasoning_content` field (breaking API change)