Orthrus-Qwen3: up to 7.8×tokens/forward on Qwen3, identical output distribution
243 points
• 3 days ago
• Article
Link
Orthrus 是一个新框架,旨在在不降低输出质量的前提下显著加速大型语言模型(LLM)的推理。它采用双架构,将传统自回归模型的逐 token 精准生成与扩散模型的高速并行能力相结合,从而突破通常限制 LLM 文本生成速度的顺序瓶颈,在保持严格无损生成的同时,实现了最高约 7.8 倍的加速。
系统通过同一模型的两种"视图"运行:自回归视图和扩散视图。两种视图共享完全相同的高保真键值(KV)缓存,几乎不增加额外内存,仅需 O(1) 级别的额外缓存。与需要独立草稿模型、因而消耗更多内存的投机解码方法(如 EAGLE-3 或 DFlash)相比,这种共享缓存是重要优势。 Orthrus 因而避免冗余,提升了 token 接受率,并且在输入上下文变长时表现更佳。
Orthrus 的另一个显著优势是参数效率:并行生成能力只通过微调约 16% 的模型参数来实现,而基础 LLM 保持完全冻结,使其成为对现有模型进行实用且高效升级的路径。该框架已在 Qwen3 骨干上实现,并提供多个模型检查点(1.7B 、 4B 和 8B 参数),所有版本均保证输出与原始基础模型的预测分布严格一致。
在性能基准测试中,Orthrus 持续优于现有的投机解码技术。它在每次前向传递中验证通过的 token 数更多,且随上下文长度增长更具扩展性。与那些在复杂推理任务上常出现精度下降的基于扩散的语言模型(dLLM)相比,Orthrus 保持了严格的保真度。例如,在 MATH-500 基准上,它相比 Qwen3-8B 基线实现了约 6 倍的加速且精度无损,而 Fast-dLLM-v2 等方法则表现出明显的精度下降。
该项目提供了简便的安装流程和快速入门指南,用户可通过 HuggingFace 上的可用模型快速开始生成文本,并且与 vLLM 、 SGLang 等主流服务框架的原生集成即将推出。详述 Orthrus 架构的研究论文已发表于 arXiv,代码和模型以 MIT 许可证开源,方便用于研究与商业应用。
Orthrus is a new framework designed to make large language model (LLM) inference significantly faster without sacrificing output quality. It introduces a dual-architecture approach that combines the precise, token-by-token generation of traditional autoregressive models with the high-speed parallel capabilities of diffusion models. This hybrid method allows Orthrus to break through the sequential bottleneck that typically limits how fast LLMs can generate text, achieving speedups of up to 7.8 times while maintaining strictly lossless generation.
The system works by using two "views" of the same model: an autoregressive view and a diffusion view. Both views share the exact same high-fidelity Key-Value (KV) cache, which means there is virtually no additional memory overhead, only O(1) extra cache required. This shared cache is a key advantage over speculative decoding methods like EAGLE-3 or DFlash, which require separate draft models and thus consume more memory. Orthrus avoids this redundancy, leading to higher token acceptance rates and better performance, especially as the length of the input context grows.
A major strength of Orthrus is its parameter efficiency. The parallel generation capabilities are added by fine-tuning only 16% of the model's total parameters, while the base LLM remains completely frozen. This makes it a practical and efficient upgrade path for existing models. The framework has been implemented with a Qwen3 backbone, and several model checkpoints are available, including versions at 1.7B, 4B, and 8B parameters, all of which guarantee that the output matches the original base model's exact predictive distribution.
In performance benchmarks, Orthrus consistently outperforms existing speculative decoding techniques. It achieves a higher average number of verified tokens per forward pass and scales more efficiently with longer contexts. When compared to other diffusion-based language models (dLLMs), which often suffer from accuracy drops on complex reasoning tasks, Orthrus maintains strict fidelity. For example, on the MATH-500 benchmark, it delivers a roughly 6x speedup over the Qwen3-8B baseline with no loss in accuracy, whereas other methods like Fast-dLLM-v2 show significant degradation.
The project provides a straightforward installation process and a quickstart guide for users to begin generating text with the available HuggingFace models. It also notes that native integration with popular serving frameworks like vLLM and SGLang is coming soon. The research paper detailing the Orthrus architecture has been published on arXiv, and the code and models are released under an MIT license, making it accessible for both research and commercial applications.
44 comments • Comments Link
尽管该方法在逻辑上看起来合理,但此前并未被实现,而且常规的决策树(DTree)技巧也可用于类似目的。
作为一种投机解码的变体,该方法并行预测多个 token 并在后续验证,从而使 token 生成速度更接近提示处理速度。它产生与原始模型完全一致的输出分布,且额外的内存开销微乎其微。主要局限在于:若提示处理本身已经很慢,收益有限;例如在 M 系列 Mac 上,生成速度相对于提示处理速度本已较快,但在 M5 上若提示处理速度提升四倍,便可看到显著收益。
该方法并不减少总计算量,实际上通过计算更多并丢弃无效 token 增加了计算量。它的优势在于并行处理多个 token 而非逐个处理,从而更好地利用 GPU 的计算能力,减少从 VRAM 加载权重的次数。对于低批次大小的自回归 LLM 来说,瓶颈往往是内存延迟而非算力:加载和卸载权重的时间通常远超过等待计算的时间。
在类似 Claude Code 的智能体工作负载中,上下文窗口很大(150k+),瓶颈体现为每用户每秒的 token 数而非纯计算量。这也是 Nvidia 收购 Groq 以及 Cerebras 追求类似方法的原因之一。通过前缀缓存,预填充很少成为瓶颈;在涉及目录遍历和文件搜索的探索阶段,真正的瓶颈是推理 token 的解码。
实现上,该方法在冻结的自回归 Transformer 的每一层注入可训练的"扩散注意力"模块,两个注意力头共享一个 KV 缓存。扩散头并行预测 32 个 token,AR 头在第二轮进行验证,接受最长匹配前缀。可以证明其输出分布与基础模型完全一致。实验结果显示,每次前向最多可生成 7.8 个 token,在 MATH-500 上实现大约 6 倍的实际加速;训练只涉及约 16% 的参数,在 8 块 H200 GPU 上耗时不到 24 小时。
与其他扩散式语言模型(如 Dream 、 Fast-dLLM-v2 和 Mercury)不同,这些模型通常会修改基础权重并因此损失精度;而 Orthrus 则保持主干网络冻结,与 Qwen3-8B 的精度完全一致。与 EAGLE-3 、 DFlash 等投机解码方法相比,Orthrus 无需外部草稿模型、无需独立缓存,也没有首 token 延迟。 KV 的额外开销恒定约为 4.5 MiB;在 MATH-500 上的接受长度为 11.7,而 DFlash 为 7.9,EAGLE-3 为 3.5 。
将该技术适配到 GGUF 文件并不复杂,但需要基于 Qwen3 衍生出一种新的架构并加入投机解码支持,因为即使是多 token 预测(MTP)也尚未并入 llama.cpp 。
该方法有望扩展到更大模型(例如 Qwen 3.6 27B),其训练流程类似于 LoRA 或蒸馏。验证工作可以先在较小模型(如 Qwen3.5 0.8B)与消费级 GPU 上开展,然后逐步放大。需要指出的是,Qwen 3.6 已支持多 token 生成功能,但那是基于逐 token 的投机而非本文所述的基于扩散的方法。
该方法在概念上靠近 DFlash,但其扩散头在每一层运行并共享原始模型的 KV 缓存。核心洞察是:在潜在空间中若能实现约 95% 准确率的预测器,理论上可带来 ~7 倍的加速,但在更大层规模下维持这种预测能力仍是扩展中的挑战。
总体而言,讨论的核心是通过并行 token 预测来加速 LLM 推理:在保证输出保真度的前提下,通过减少 VRAM 中权重加载次数来缓解自回归模型的内存带宽瓶颈,代价是总计算量的增加。虽然在消费级硬件及长上下文的智能体工作负载上前景可观,但实际采用取决于主流推理框架的实现支持、在更大模型上的验证以及与各种量化格式的兼容性。 • The technique wasn't implemented before despite seeming logical, and standard decision tree (DTree) tricks are also applicable to this approach.
• The method functions as a speculative decoding variant where multiple tokens are predicted in parallel and then verified, bringing token generation speed closer to prompt processing speed. It produces the exact same output distribution as the base model with negligible additional memory overhead. The main limitation is that it provides little benefit if prompt processing speed is already poor, such as on M-series Macs where generation speed is relatively high compared to prompt processing, though the M5's 4x prompt processing improvement should see significant gains.
• Rather than reducing compute, this approach actually increases it by computing more tokens and discarding invalid ones. The benefit comes from better exploiting GPU compute by processing multiple tokens in parallel instead of one by one, reducing the number of times weights must be loaded from VRAM. For autoregressive LLMs at low batch sizes, the bottleneck is memory latency rather than compute, as more time is spent loading and unloading weights than waiting for computation.
• For agentic workloads like Claude Code with large context windows (150k+), the bottleneck is tokens-per-second per user rather than compute, which is why companies like Nvidia acquired Groq and why Cerebras is pursuing similar approaches. With prefix caching, prefill is rarely the bottleneck compared to decoding reasoning tokens, especially during exploration phases involving directory traversal and file grepping.
• The approach involves injecting a trainable diffusion attention module into each layer of a frozen autoregressive Transformer, with both heads sharing one KV cache. The diffusion head projects 32 tokens in parallel while the AR head verifies in a second pass, accepting the longest matching prefix. Output distribution is provably identical to the base model. Results show up to 7.8x tokens per forward pass and ~6x wall-clock speedup on MATH-500, with only 16% of parameters trained on less than 1B tokens in 24 hours on 8xH200 GPUs.
• Compared to other diffusion LMs like Dream, Fast-dLLM-v2, and Mercury, which modify base weights and lose accuracy, Orthrus freezes the backbone and matches Qwen3-8B accuracy exactly. Unlike speculative decoding methods like EAGLE-3 and DFlash, it requires no external drafter, no separate cache, and has zero time-to-first-token penalty. KV overhead is constant at approximately 4.5 MiB, and acceptance length on MATH-500 is 11.7 versus 7.9 for DFlash and 3.5 for EAGLE-3.
• Adapting the technique to GGUF files would be trivial for conversion but would require creating a new architecture derived from Qwen3 and adapting speculative decoding functionality, as even multi-token prediction (MTP) hasn't been merged into llama.cpp yet.
• The method could potentially scale to larger models like Qwen 3.6 27B, with the training process resembling LoRA training or distillation. Validation could start with smaller models like Qwen3.5 0.8B on consumer GPUs before scaling up. Qwen 3.6 already supports multi-token generation but uses token-at-a-time speculation rather than the diffusion-based approach described here.
• The technique is conceptually similar to DFlash but operates at each transformer layer while sharing the original model's KV cache. The core insight is that a 95% accurate predictor in latent space can yield a 7x speedup when implemented correctly, though predictivity at larger layer sizes remains a question for scaling.
The discussion centers on a novel approach to accelerating LLM inference through parallel token prediction with guaranteed output fidelity. The technique addresses the fundamental memory bandwidth bottleneck in autoregressive models by reducing VRAM weight loading operations, though it increases total compute. While promising for consumer hardware and agentic workloads with long contexts, practical adoption depends on implementation support in popular inference frameworks and validation across larger model architectures. The method's key advantage over alternatives is maintaining exact output distribution matching with minimal memory overhead, though questions remain about scaling to larger models and compatibility with quantization formats.