Releases: raullenchai/Rapid-MLX
v0.6.4 — Day-0 DeepSeek V4 Flash
🚀 Day-0 DeepSeek-V4-Flash support
First MLX backend with day-0 DeepSeek-V4-Flash (158B-A13B, 1M context) support — released by DeepSeek 2026-04-24, runnable on Mac Studio Ultra today.
What's new
rapid-mlx serve deepseek-v4-flash— points to the 8-bit variant (155 GB on disk, ~136 GB peak RAM, fits 192 GB+ Macs). Alsodeepseek-v4-flash-2bitfor 128 GB Macs anddeepseek-v4-flash-4bitfor distributed setups.- Vendored architecture from mlx-lm PR #1192 (Prince Canuma / Blaizzy) — registered transparently via
sys.modules['mlx_lm.models.deepseek_v4']. mlx-lm 0.32+ will eventually merge native support; we'll drop the vendor at that point. chat_template.jinjafallback in the tokenizer loader — DeepSeek V4 ships its template as a separate file rather than embedding intokenizer_config.json. UTF-8 BOM-safe.- Vendored-arch routing: detects
model_typein our vendored set upfront and bypasses transformers'AutoConfig/PreTrainedConfigpaths (which trip on the unknown architecture's RoPE standardization).
Performance — Mac Studio M3 Ultra (256 GB)
| Variant | Decode | Prefill | TTFT cold | Peak RAM | Disk |
|---|---|---|---|---|---|
| 2-bit DQ | 56 tok/s | 443 tok/s | 0.78 s | 91 GB | 90 GB |
| 8-bit | 31 tok/s | 415 tok/s | 2.61 s | 136 GB | 145 GB |
Stress test: 7/8 scenarios pass on both quants (sustained throughput, concurrent load, long generation, rapid fire, mixed workload, disconnect resilience, memory stability).
Known limitation — tool calling
Tool calling currently scores 0/30 on our 30-scenario eval (evals/run_eval.py --suite tool_calling). Root cause is upstream: the chat template that mlx-community ships for DeepSeek V4 only handles system/user/assistant — no tool role rendering, no tools array iteration, no <tool_call> markers. Tools are silently dropped before reaching the model.
Plain chat works perfectly. For agentic use today, recommend Qwen3.6-35B (100% tool calling rate per SCORECARD). DeepSeek V4 tool template work is tracked as a follow-up.
⚠️ Upgrade note for existing Homebrew users
brew upgrade from 0.6.3 → 0.6.4 may fail with a link conflict because the 0.6.3 install left a non-symlink file at /opt/homebrew/bin/rapid-mlx. If you hit Error: rapid-mlx ... already exists, run:
brew link --overwrite raullenchai/rapid-mlx/rapid-mlx
New brew installs (brew install raullenchai/rapid-mlx/rapid-mlx) are unaffected.
Other
- Tracking PR: #168 (squashed)
- Vendored arch will be removed once mlx-lm 0.32+ ships native
deepseek_v4 - 4 new unit tests covering vendoring + Metal kernel forward pass
- Onboarding-tested: clean install via
pip install rapid-mlx==0.6.4and via Homebrew tap
Install:
pip install --upgrade rapid-mlx==0.6.4
# or
brew upgrade raullenchai/rapid-mlx/rapid-mlx🤖 Generated with Claude Code
v0.6.3 — clean shutdown bundle
Highlights
Single-fix patch release. Shutdown noise on hybrid models is gone.
Recommended for users on 0.6.2 who saw `Stream(gpu, N)` warnings or trailing tracebacks at `Application shutdown complete`.
Fixes
- #167 — Route prefix-cache persistence and `BatchGenerator` teardown through the mlx-step worker thread that owns the per-thread `generation_stream`. Eliminates two flavors of `RuntimeError: There is no Stream(gpu, N) in current thread.` left over after the #161 hot-path fix:
- Prefix-cache `save_to_disk` on shutdown was running on the asyncio loop thread; ~5/119 entries dropped on Qwen3.6-27B 4-bit. Now routed through the worker. 117/117 entries persist.
- `BatchGenerator.del` (mlx-lm) called `mx.synchronize` from arbitrary GC threads after the worker had already exited. Now closed explicitly on the worker before executor shutdown.
Discovered during 0.6.2 onboarding QA — running the exact #166 user repro on a fresh PyPI install surfaced both bugs.
Verification (Qwen3.6-27B 4-bit)
| Metric | 0.6.2 | 0.6.3 |
|---|---|---|
| Generation tok/s | 37 | 37 (no change) |
| Cache entries persisted on SIGINT | 114/119 (96%) | 117/117 (100%) |
| `Stream(gpu, N)` warnings during persist | 5 | 0 |
| Trailing `RuntimeError` from `BatchGenerator.del` | 1 | 0 |
Upgrade
```bash
pip install -U rapid-mlx
```
Full changelog
v0.6.2 — mlx-lm 0.31+ correctness bundle
Highlights
Four correctness fixes for users on mlx-lm 0.31+. Hybrid models (Qwen3.5 / Qwen3.6) were silently missing the prefix cache forever, and the BatchedEngine worker could hit RuntimeError: There is no Stream(gpu, N) in current thread on first generation. Both fixed.
Recommended upgrade for everyone on 0.6.1.
Fixes
- #161 — MLX generation stream thread ownership (fixes #160, #166). The dedicated mlx-step worker thread now creates and owns its own
mx.new_stream(...)somlx_lm.generate.generation_streamresolves correctly. Without this, every generation call returned 1 token at 0.0 tok/s and silently logged the stream error. Thanks @samuelfaj. - #163 / #165 — Restore prompt-boundary cache snapshot for hybrid models on mlx-lm 0.31+. PR #105 dropped the snapshot when gating out the legacy
_install_chunked_prefillmonkey-patch; hybrid models (Qwen3.5/3.6 DeltaNet+attention) consequently MISS the prefix cache on every request. Reimplemented using only public mlx-lm API:PromptProcessingBatch.Response.end_of_prompt+BatchGenerator.extract_cache(uids). Verified end-to-end: cold 91ms → warm 24ms (3.81× speedup) on Qwen3.5-9B. Thanks @luizribeiro for the report. - #164 — Cleanup of #161 (move
import systo module top, drop dead defensive rebind). - Thread-safe
MemoryAwarePrefixCache— pre-existing race between the asynciofetchpath and the worker-threadstore/evictpath could KeyError on_entries[k]when an_evict_lrureordered before the read. Added athreading.Lockand reordered eviction (_remove_from_sortedbefore_entries.pop).
Upgrade
pip install -U rapid-mlxFull changelog
v0.6.1 — Qwen3.6 Day 0 Support
Qwen3.6 Day 0 Support
Run Qwen3.6 on your Mac in one command:
pip install -U rapid-mlx
rapid-mlx serve qwen3.6-27b # dense 27B, 14.9GB, 36.5 tok/s
rapid-mlx serve qwen3.6-35b # MoE 35B-A3B, 19GB, 92 tok/sNew Aliases
| Alias | Model | RAM | Speed |
|---|---|---|---|
qwen3.6-27b |
mlx-community/Qwen3.6-27B-4bit | 14.9 GB | 36.5 tok/s |
qwen3.6-27b-8bit |
unsloth/Qwen3.6-27B-MLX-8bit | 32.3 GB | 18.9 tok/s |
qwen3.6-35b-6bit |
mlx-community/Qwen3.6-35B-A3B-6bit | ~28 GB | ~72 tok/s |
Highlights
- Qwen3.6-35B: 12% faster than Qwen3.5-35B (92 vs 82 tok/s)
- Qwen3.6-27B: Dense hybrid (64 layers, DeltaNet + Attention), 262K native context, vision
- Auto-detected parser:
qwen3_coder_xml— justrapid-mlx serve qwen3.6-27b, parsers auto-configured - Coding: 100% on eval suite
- Stress test: 8/8 PASS
Also in this release
- TurboQuant KV cache compression (#157)
v0.6.0 — Gemma 4 fix, logprobs, server decomposition
What's New in v0.6.0
Bug Fixes
- Fix Gemma 4 degeneration — All Gemma 4 models now produce correct output (was infinite repetition). Root cause: custom model wrapper incompatible with mlx-lm 0.31. (#148)
- Streaming JSON mode thinking leak —
response_formatwithstream=trueno longer leaks thinking preamble (#46) - Per-request parser instances — Concurrent BatchedEngine requests no longer corrupt each other's reasoning/tool parser state (#P1)
- Per-request sampler — Temperature/top_p now correctly applied per request (was using default argmax)
New Features
- logprobs support —
logprobs: true+top_logprobs: Nnow returns per-token log-probabilities in both streaming and non-streaming modes (#44) rapid-mlx doctor— User-facing self-diagnostic (Metal GPU, imports, CLI, model load check). Works from pip install, no dev tools needed.- Dev test suite —
make lint/test/smoke/stress/soakfor developers. 2100+ unit tests. - Pipeline architecture —
DecodeStrategy/DecodePlugininterfaces for future optimization plugins (TurboQuant, speculative decode)
Architecture
- server.py decomposition — 4025 → 1047 lines (-74%) across 4 PRs
- Routes extracted:
routes/chat.py,routes/completions.py,routes/anthropic.py - Helpers extracted:
service/helpers.py - Cache extracted:
runtime/cache.py - PostProcessor: 100% test coverage (253 statements)
- Routes extracted:
- Architecture docs —
docs/architecture.mdwith module map and pipeline design
Testing
- Stress tested across 7 model architectures: Gemma 4, Qwen3.6, Qwen3.5, Phi-4, Devstral, Llama, Mistral
- 10-minute agent soak tests: 287+ requests, 0 errors
- 2100+ unit tests, 0 regressions
Full Changelog
v0.5.10 — Qwen 3.6 Day-0 (stable)
Qwen 3.6-35B-A3B — Day-0 Support (Stable)
This release wraps up comprehensive testing and fixes for Qwen 3.6 support. 6 rounds of persona onboarding tests, stress tests, and chaos tests — all passing.
pip install -U rapid-mlx
rapid-mlx serve qwen3.6-35bBenchmark (M3 Ultra)
| Model | Decode | TTFT | Tool Calling | VRAM |
|---|---|---|---|---|
| Qwen3.6-35B-A3B (4-bit) | 95 tok/s | 219ms | 100% | 20 GB |
| Qwen3.5-4B (4-bit) | 160 tok/s | 187ms | 100% | 2.4 GB |
Changes since v0.5.2
Qwen 3.6 support:
- Auto-config:
qwen3_coder_xmlparser (XML tool format, different from Qwen3.5) - Alias:
rapid-mlx serve qwen3.6-35b - Added to all 11 agent profiles + doctor full tier
Stability & UX (8 releases of fixes):
- Thinking token budget: small
max_tokensno longer truncated by reasoning reasoning_contentfield (OpenAI-compatible) with backward-compatreasoningaliasmodelfield now optional (defaults to loaded model)- Agent setup auto-detects running model (
rapid-mlx agents hermes) - Concurrent request serialization (prevents Metal crash in SimpleEngine)
__version__syncs with PyPI automatically- Improved JSON mode compliance
- Docs fully rebranded to
rapid-mlx rapid-mlx modelsshows size/speed/recommended Mac
Full Changelog: v0.5.2...v0.5.10
v0.5.9 — Performance regression fix
Fixes
Thinking budget now respects enable_thinking: false
Previously, the +2048 thinking token budget was always applied when a reasoning parser was active, even when enable_thinking: false was set. This caused:
- Requests with
enable_thinking: falseto still generate 100+ tokens instead of the requested 30 - Progressive slowdown: 1s → 21s over 50 requests (tokens accumulated in prompt cache)
- 6.6x total time wasted (675s → 102s for 50 requests)
Before: max_tokens=30 + enable_thinking=false → internal budget 2078, avg 13.5s
After: max_tokens=30 + enable_thinking=false → internal budget 30, avg 2.0s
Truncated reasoning handled correctly
<think>without</think>(high temp, max_tokens truncation) → classified asreasoning_content, not leaked intocontent
Full Changelog: v0.5.8...v0.5.9
v0.5.8 — Stability & reasoning fixes
Fixes
Reasoning parser handles truncated thinking
- When
<think>appears without</think>(high temperature, max_tokens truncation), content is now correctly classified as reasoning instead of leaking into the response content field - Previously:
content: "<think>thinking process..."(broken) - Now:
reasoning_content: "thinking process...",content: null(correct)
Concurrent request safety (SimpleEngine)
- Added HTTP middleware to serialize
/v1/chatrequests when using SimpleEngine - Prevents Metal GPU crash (
memory corruption of free block) when multiple clients (e.g., Hermes + Cursor) hit the server simultaneously - BatchedEngine is unaffected (handles concurrency via Scheduler)
Improved JSON mode compliance
response_format: json_objectprompt now instructs model to follow requested structure (arrays, multiple items, exact keys)
Full Changelog: v0.5.7...v0.5.8
v0.5.7 — API cleanup + thinking budget
Fixes
Unified reasoning_content field (breaking API change)
- Renamed
reasoning→reasoning_contentin all API responses (matches OpenAI's o1/o3 format) - Applies to both non-streaming (
message.reasoning_content) and streaming (delta.reasoning_content) - Removed duplicate
reasoningalias — responses are now ~50% smaller for thinking models
Larger thinking token budget
- Budget increased from +1024 to +2048 tokens
- Threshold raised from 2048 to 4096 — covers more requests
- Complex coding prompts (fibonacci, debugging) no longer truncate
Version sync
__version__now reads from package metadata instead of hardcoded stringpython -c "import vllm_mlx; print(vllm_mlx.__version__)"now matches PyPI
Full Changelog: v0.5.6...v0.5.7
v0.5.6 — Codex review fixes
Fixes
- Response model field: No longer returns literal
"default"— resolves to actual model name (e.g.,mlx-community/Qwen3.6-35B-A3B-4bit) - max_tokens=0: No longer silently treated as None (was becoming 4096)
- Thinking budget: Now applies based on resolved value, covering edge case where CLI
--max-tokensis small and user omits the field
Full Changelog: v0.5.5...v0.5.6