Skip to content

Releases: raullenchai/Rapid-MLX

v0.6.4 — Day-0 DeepSeek V4 Flash

29 Apr 13:34

Choose a tag to compare

🚀 Day-0 DeepSeek-V4-Flash support

First MLX backend with day-0 DeepSeek-V4-Flash (158B-A13B, 1M context) support — released by DeepSeek 2026-04-24, runnable on Mac Studio Ultra today.

What's new

  • rapid-mlx serve deepseek-v4-flash — points to the 8-bit variant (155 GB on disk, ~136 GB peak RAM, fits 192 GB+ Macs). Also deepseek-v4-flash-2bit for 128 GB Macs and deepseek-v4-flash-4bit for distributed setups.
  • Vendored architecture from mlx-lm PR #1192 (Prince Canuma / Blaizzy) — registered transparently via sys.modules['mlx_lm.models.deepseek_v4']. mlx-lm 0.32+ will eventually merge native support; we'll drop the vendor at that point.
  • chat_template.jinja fallback in the tokenizer loader — DeepSeek V4 ships its template as a separate file rather than embedding in tokenizer_config.json. UTF-8 BOM-safe.
  • Vendored-arch routing: detects model_type in our vendored set upfront and bypasses transformers' AutoConfig/PreTrainedConfig paths (which trip on the unknown architecture's RoPE standardization).

Performance — Mac Studio M3 Ultra (256 GB)

Variant Decode Prefill TTFT cold Peak RAM Disk
2-bit DQ 56 tok/s 443 tok/s 0.78 s 91 GB 90 GB
8-bit 31 tok/s 415 tok/s 2.61 s 136 GB 145 GB

Stress test: 7/8 scenarios pass on both quants (sustained throughput, concurrent load, long generation, rapid fire, mixed workload, disconnect resilience, memory stability).

Known limitation — tool calling

Tool calling currently scores 0/30 on our 30-scenario eval (evals/run_eval.py --suite tool_calling). Root cause is upstream: the chat template that mlx-community ships for DeepSeek V4 only handles system/user/assistant — no tool role rendering, no tools array iteration, no <tool_call> markers. Tools are silently dropped before reaching the model.

Plain chat works perfectly. For agentic use today, recommend Qwen3.6-35B (100% tool calling rate per SCORECARD). DeepSeek V4 tool template work is tracked as a follow-up.

⚠️ Upgrade note for existing Homebrew users

brew upgrade from 0.6.3 → 0.6.4 may fail with a link conflict because the 0.6.3 install left a non-symlink file at /opt/homebrew/bin/rapid-mlx. If you hit Error: rapid-mlx ... already exists, run:

brew link --overwrite raullenchai/rapid-mlx/rapid-mlx

New brew installs (brew install raullenchai/rapid-mlx/rapid-mlx) are unaffected.

Other

  • Tracking PR: #168 (squashed)
  • Vendored arch will be removed once mlx-lm 0.32+ ships native deepseek_v4
  • 4 new unit tests covering vendoring + Metal kernel forward pass
  • Onboarding-tested: clean install via pip install rapid-mlx==0.6.4 and via Homebrew tap

Install:

pip install --upgrade rapid-mlx==0.6.4
# or
brew upgrade raullenchai/rapid-mlx/rapid-mlx

🤖 Generated with Claude Code

v0.6.3 — clean shutdown bundle

29 Apr 00:14

Choose a tag to compare

Highlights

Single-fix patch release. Shutdown noise on hybrid models is gone.

Recommended for users on 0.6.2 who saw `Stream(gpu, N)` warnings or trailing tracebacks at `Application shutdown complete`.

Fixes

  • #167 — Route prefix-cache persistence and `BatchGenerator` teardown through the mlx-step worker thread that owns the per-thread `generation_stream`. Eliminates two flavors of `RuntimeError: There is no Stream(gpu, N) in current thread.` left over after the #161 hot-path fix:
    1. Prefix-cache `save_to_disk` on shutdown was running on the asyncio loop thread; ~5/119 entries dropped on Qwen3.6-27B 4-bit. Now routed through the worker. 117/117 entries persist.
    2. `BatchGenerator.del` (mlx-lm) called `mx.synchronize` from arbitrary GC threads after the worker had already exited. Now closed explicitly on the worker before executor shutdown.

Discovered during 0.6.2 onboarding QA — running the exact #166 user repro on a fresh PyPI install surfaced both bugs.

Verification (Qwen3.6-27B 4-bit)

Metric 0.6.2 0.6.3
Generation tok/s 37 37 (no change)
Cache entries persisted on SIGINT 114/119 (96%) 117/117 (100%)
`Stream(gpu, N)` warnings during persist 5 0
Trailing `RuntimeError` from `BatchGenerator.del` 1 0

Upgrade

```bash
pip install -U rapid-mlx
```

Full changelog

v0.6.2...v0.6.3

v0.6.2 — mlx-lm 0.31+ correctness bundle

28 Apr 23:09

Choose a tag to compare

Highlights

Four correctness fixes for users on mlx-lm 0.31+. Hybrid models (Qwen3.5 / Qwen3.6) were silently missing the prefix cache forever, and the BatchedEngine worker could hit RuntimeError: There is no Stream(gpu, N) in current thread on first generation. Both fixed.

Recommended upgrade for everyone on 0.6.1.

Fixes

  • #161 — MLX generation stream thread ownership (fixes #160, #166). The dedicated mlx-step worker thread now creates and owns its own mx.new_stream(...) so mlx_lm.generate.generation_stream resolves correctly. Without this, every generation call returned 1 token at 0.0 tok/s and silently logged the stream error. Thanks @samuelfaj.
  • #163 / #165 — Restore prompt-boundary cache snapshot for hybrid models on mlx-lm 0.31+. PR #105 dropped the snapshot when gating out the legacy _install_chunked_prefill monkey-patch; hybrid models (Qwen3.5/3.6 DeltaNet+attention) consequently MISS the prefix cache on every request. Reimplemented using only public mlx-lm API: PromptProcessingBatch.Response.end_of_prompt + BatchGenerator.extract_cache(uids). Verified end-to-end: cold 91ms → warm 24ms (3.81× speedup) on Qwen3.5-9B. Thanks @luizribeiro for the report.
  • #164 — Cleanup of #161 (move import sys to module top, drop dead defensive rebind).
  • Thread-safe MemoryAwarePrefixCache — pre-existing race between the asyncio fetch path and the worker-thread store/evict path could KeyError on _entries[k] when an _evict_lru reordered before the read. Added a threading.Lock and reordered eviction (_remove_from_sorted before _entries.pop).

Upgrade

pip install -U rapid-mlx

Full changelog

v0.6.1...v0.6.2

v0.6.1 — Qwen3.6 Day 0 Support

22 Apr 18:13

Choose a tag to compare

Qwen3.6 Day 0 Support

Run Qwen3.6 on your Mac in one command:

pip install -U rapid-mlx
rapid-mlx serve qwen3.6-27b    # dense 27B, 14.9GB, 36.5 tok/s
rapid-mlx serve qwen3.6-35b    # MoE 35B-A3B, 19GB, 92 tok/s

New Aliases

Alias Model RAM Speed
qwen3.6-27b mlx-community/Qwen3.6-27B-4bit 14.9 GB 36.5 tok/s
qwen3.6-27b-8bit unsloth/Qwen3.6-27B-MLX-8bit 32.3 GB 18.9 tok/s
qwen3.6-35b-6bit mlx-community/Qwen3.6-35B-A3B-6bit ~28 GB ~72 tok/s

Highlights

  • Qwen3.6-35B: 12% faster than Qwen3.5-35B (92 vs 82 tok/s)
  • Qwen3.6-27B: Dense hybrid (64 layers, DeltaNet + Attention), 262K native context, vision
  • Auto-detected parser: qwen3_coder_xml — just rapid-mlx serve qwen3.6-27b, parsers auto-configured
  • Coding: 100% on eval suite
  • Stress test: 8/8 PASS

Also in this release

  • TurboQuant KV cache compression (#157)

v0.6.0 — Gemma 4 fix, logprobs, server decomposition

20 Apr 03:00

Choose a tag to compare

What's New in v0.6.0

Bug Fixes

  • Fix Gemma 4 degeneration — All Gemma 4 models now produce correct output (was infinite repetition). Root cause: custom model wrapper incompatible with mlx-lm 0.31. (#148)
  • Streaming JSON mode thinking leakresponse_format with stream=true no longer leaks thinking preamble (#46)
  • Per-request parser instances — Concurrent BatchedEngine requests no longer corrupt each other's reasoning/tool parser state (#P1)
  • Per-request sampler — Temperature/top_p now correctly applied per request (was using default argmax)

New Features

  • logprobs supportlogprobs: true + top_logprobs: N now returns per-token log-probabilities in both streaming and non-streaming modes (#44)
  • rapid-mlx doctor — User-facing self-diagnostic (Metal GPU, imports, CLI, model load check). Works from pip install, no dev tools needed.
  • Dev test suitemake lint/test/smoke/stress/soak for developers. 2100+ unit tests.
  • Pipeline architectureDecodeStrategy / DecodePlugin interfaces for future optimization plugins (TurboQuant, speculative decode)

Architecture

  • server.py decomposition — 4025 → 1047 lines (-74%) across 4 PRs
    • Routes extracted: routes/chat.py, routes/completions.py, routes/anthropic.py
    • Helpers extracted: service/helpers.py
    • Cache extracted: runtime/cache.py
    • PostProcessor: 100% test coverage (253 statements)
  • Architecture docsdocs/architecture.md with module map and pipeline design

Testing

  • Stress tested across 7 model architectures: Gemma 4, Qwen3.6, Qwen3.5, Phi-4, Devstral, Llama, Mistral
  • 10-minute agent soak tests: 287+ requests, 0 errors
  • 2100+ unit tests, 0 regressions

Full Changelog

v0.5.3...v0.6.0

v0.5.10 — Qwen 3.6 Day-0 (stable)

17 Apr 18:02

Choose a tag to compare

Qwen 3.6-35B-A3B — Day-0 Support (Stable)

This release wraps up comprehensive testing and fixes for Qwen 3.6 support. 6 rounds of persona onboarding tests, stress tests, and chaos tests — all passing.

pip install -U rapid-mlx
rapid-mlx serve qwen3.6-35b

Benchmark (M3 Ultra)

Model Decode TTFT Tool Calling VRAM
Qwen3.6-35B-A3B (4-bit) 95 tok/s 219ms 100% 20 GB
Qwen3.5-4B (4-bit) 160 tok/s 187ms 100% 2.4 GB

Changes since v0.5.2

Qwen 3.6 support:

  • Auto-config: qwen3_coder_xml parser (XML tool format, different from Qwen3.5)
  • Alias: rapid-mlx serve qwen3.6-35b
  • Added to all 11 agent profiles + doctor full tier

Stability & UX (8 releases of fixes):

  • Thinking token budget: small max_tokens no longer truncated by reasoning
  • reasoning_content field (OpenAI-compatible) with backward-compat reasoning alias
  • model field now optional (defaults to loaded model)
  • Agent setup auto-detects running model (rapid-mlx agents hermes)
  • Concurrent request serialization (prevents Metal crash in SimpleEngine)
  • __version__ syncs with PyPI automatically
  • Improved JSON mode compliance
  • Docs fully rebranded to rapid-mlx
  • rapid-mlx models shows size/speed/recommended Mac

Full Changelog: v0.5.2...v0.5.10

v0.5.9 — Performance regression fix

17 Apr 16:59

Choose a tag to compare

Fixes

Thinking budget now respects enable_thinking: false

Previously, the +2048 thinking token budget was always applied when a reasoning parser was active, even when enable_thinking: false was set. This caused:

  • Requests with enable_thinking: false to still generate 100+ tokens instead of the requested 30
  • Progressive slowdown: 1s → 21s over 50 requests (tokens accumulated in prompt cache)
  • 6.6x total time wasted (675s → 102s for 50 requests)

Before: max_tokens=30 + enable_thinking=false → internal budget 2078, avg 13.5s
After: max_tokens=30 + enable_thinking=false → internal budget 30, avg 2.0s

Truncated reasoning handled correctly

  • <think> without </think> (high temp, max_tokens truncation) → classified as reasoning_content, not leaked into content

Full Changelog: v0.5.8...v0.5.9

v0.5.8 — Stability & reasoning fixes

17 Apr 15:57

Choose a tag to compare

Fixes

Reasoning parser handles truncated thinking

  • When <think> appears without </think> (high temperature, max_tokens truncation), content is now correctly classified as reasoning instead of leaking into the response content field
  • Previously: content: "<think>thinking process..." (broken)
  • Now: reasoning_content: "thinking process...", content: null (correct)

Concurrent request safety (SimpleEngine)

  • Added HTTP middleware to serialize /v1/chat requests when using SimpleEngine
  • Prevents Metal GPU crash (memory corruption of free block) when multiple clients (e.g., Hermes + Cursor) hit the server simultaneously
  • BatchedEngine is unaffected (handles concurrency via Scheduler)

Improved JSON mode compliance

  • response_format: json_object prompt now instructs model to follow requested structure (arrays, multiple items, exact keys)

Full Changelog: v0.5.7...v0.5.8

v0.5.7 — API cleanup + thinking budget

17 Apr 14:26

Choose a tag to compare

Fixes

Unified reasoning_content field (breaking API change)

  • Renamed reasoningreasoning_content in all API responses (matches OpenAI's o1/o3 format)
  • Applies to both non-streaming (message.reasoning_content) and streaming (delta.reasoning_content)
  • Removed duplicate reasoning alias — responses are now ~50% smaller for thinking models

Larger thinking token budget

  • Budget increased from +1024 to +2048 tokens
  • Threshold raised from 2048 to 4096 — covers more requests
  • Complex coding prompts (fibonacci, debugging) no longer truncate

Version sync

  • __version__ now reads from package metadata instead of hardcoded string
  • python -c "import vllm_mlx; print(vllm_mlx.__version__)" now matches PyPI

Full Changelog: v0.5.6...v0.5.7

v0.5.6 — Codex review fixes

17 Apr 14:02

Choose a tag to compare

Fixes

  • Response model field: No longer returns literal "default" — resolves to actual model name (e.g., mlx-community/Qwen3.6-35B-A3B-4bit)
  • max_tokens=0: No longer silently treated as None (was becoming 4096)
  • Thinking budget: Now applies based on resolved value, covering edge case where CLI --max-tokens is small and user omits the field

Full Changelog: v0.5.5...v0.5.6