DeepSeek-V4-Flash means LLM steering is interesting again
273 points
• 2 days ago
• Article
Link
LLM 引导技术因 DeepSeek-V4-Flash 的出现又重新受到关注。这款开源模型足够强大,在代理型编码任务上能与一些低端前沿模型竞争。由于引导需直接访问本地模型的内部激活值,过去对大多数工程师而言并不现实。 DeepSeek-V4-Flash 改变了这一局面,开发者 antirez 已在 DwarfStar 4(为该模型特别精简的 llama.cpp 分支)中加入了引导支持。尽管目前的实现还很基础,该项目仅上线八天,但值得持续关注。
引导的原理是从模型的内部激活值中提取出某个概念,然后在推理时提升那些特定数值。最简单的做法是把同一条提示分别输入两次,一次正常输入、一次加上诸如"简洁回答"之类的修饰语,然后比较两次的激活值差异。这个差异就构成了一个"引导向量",可以在任意层加到激活值上以产生期望效果。更复杂的方法会用稀疏自编码器等手段来识别模型行为中更深层的模式,类似 Anthropic 的相关研究。引导的吸引力在于,它像是找到了模型"大脑"的控制面板,可以用滑块直接调整冗长或细致程度等特质,而不是反复琢磨提示措辞。
尽管吸引人,引导并未广泛普及,原因有几方面。大厂可以通过训练直接改动模型,无需在推理时做这种笨拙的"手术";普通用户通过 API 使用时也无法访问权重和激活值;而且许多基础的引导需求其实已经能通过更巧妙的提示解决——提示词元本身就能对模型行为提供极细粒度的控制。引导因此处于一个尴尬的中间地带:对大多数用户而言太复杂,对拥有完整模型访问的大厂来说又没必要。
引导最有希望的应用场景是提示失败时的补救。例如,"智能"这类能力过去还能靠"你是专家"之类的提示激活,但现在已经内嵌在新一代模型中。作者怀疑是否存在一个能实用地表示"智能"的引导向量,因为这类复杂概念很可能分布在几乎整个模型的权重上,解决它等同于训练出更聪明的模型。另一种可行性稍高的想法是把引导当作一种数据压缩手段:提取那些本来需要大量词元才能表达的概念,比如对某个代码库的深入知识。尽管略显可行,这种做法仍面临同样的根本挑战。
总体上,作者对引导技术抱有兴趣但持悲观态度,认为大多数收益更适合通过提示优化或微调来获得。不过开源社区对引导的探索尚浅,这一状况可能正在改变。如果引导确实有隐藏的实用价值,未来六个月内应该会逐步显现。未来开源权重模型发布时,也可能会出现社区提取的"可增强特征库",类似目前量化版本和各种封装器的繁荣。
来自 Hacker News 评论的一条重要更新指出,引导能够改变提示无法触及的已训练行为,尤其是在移除模型拒绝回答方面已有实际效果——这也是一些开源模型去审查或所谓"abliteration"操作的实现方式之一。 antirez 指出,相比一次性修改权重,运行时引导对模型能力的损害更小且可按需启用,因此是一种更可取的轻量方案。
LLM steering is experiencing renewed interest thanks to DeepSeek-V4-Flash, a new open model that's powerful enough to compete with lower-end frontier models for agentic coding tasks. Since steering requires direct access to a local model's internal activations, it has historically been impractical for most engineers. DeepSeek-V4-Flash changes that equation, and developer antirez has already built steering support into DwarfStar 4, a stripped-down llama.cpp fork designed specifically for this model. While the current implementation is basic, the project is only eight days old and worth watching.
Steering works by extracting a concept from a model's internal activations and then boosting those specific numerical values during inference. The simplest approach involves feeding the same prompts twice, once normally and once with a modifier like "respond tersely," then measuring the difference in activations between the two runs. This difference becomes a "steering vector" that can be added to activations at any layer to produce the desired effect. More sophisticated methods use sparse autoencoders to identify deeper patterns in the model's behavior, similar to what Anthropic has published research on. The appeal of steering is that it feels like finding a control panel for the model's brain, with sliders for traits like verbosity or conscientiousness that could be adjusted directly rather than fiddling with prompt wording.
Despite its appeal, steering hasn't seen widespread adoption for several reasons. Major AI labs can manipulate their models directly through training rather than awkward mid-inference surgery. Regular users lack access to model weights and activations when using APIs. Most basic steering applications are already outcompeted by simply prompting the model more effectively, since prompt tokens already provide extremely fine-grained control over model behavior. Steering occupies an awkward middle ground that's too complex for most users but unnecessary for the labs with full model access.
The most promising potential for steering lies in cases where prompting fails. One example is "intelligence" itself, which used to be promptable with phrases like "you are an expert" but is now baked into current-generation models. However, the author is skeptical that an "intelligence" steering vector exists in any practical sense, since such a complex concept likely spans nearly the entire model's weights, making the problem equivalent to training a smarter model. Another possibility is using steering as data compression, extracting concepts that would otherwise require many tokens to express, like deep knowledge of a specific codebase. This seems marginally more plausible but still faces the same fundamental challenge.
The author remains fascinated but ultimately pessimistic about steering's practical applications, believing most gains can be more efficiently achieved through prompting or fine-tuning. However, the open-source community hasn't explored steering extensively yet, and that may be changing. If steering does have hidden practical value, the next six months should reveal it. It's possible that future open-weight model releases will come with community-extracted "libraries" of boostable features, similar to how quantized versions and wrappers currently proliferate.
A notable update from Hacker News comments revealed that steering can modify trained-in behaviors in ways prompting cannot, most significantly for removing model refusals. This is already how some uncensoring or "abliteration" is done for open models. Antirez pointed out that weight modification can damage model capabilities more than runtime steering, which can be applied only when needed, making the lighter-touch approach preferable.
75 comments • Comments Link
- DwarfStar 4 的转向特性允许在运行时完全移除 DeepSeek V4 的拒绝行为。这比直接修改 GGUF 更优,因为转向仅在必要时生效——例如在特定时刻或当拒绝方向的能量超过阈值时——从而把对模型能力的影响降到最低。
- 相较于西方 AI 模型,DeepSeek V4 本身就表现出更少的拒绝;但反拒绝转向向量能让它回应甚至看起来不恰当的请求,凸显出模型在审查上的先天宽松性。
- 转向向量提供了一种动态方案:可以发布带有审查机制的模型,同时允许用户按需禁用拒绝(例如用于网络安全研究等合法用途),而不影响与这些任务无关的性能。
- Anthropic 在提高通用能力的同时,有意在网络安全相关任务上降低 Opus 4.7 的表现,这反映了前沿模型在能力与安全之间的权衡。
- 未经审查的模型可能出现意外行为,例如通过反编译二进制来回答问题,这表明随着限制较少的模型变得普及,需要更严密的沙箱机制。
- 转向还可以用来改变模型的政治立场,显示出该技术在超越去除拒绝方面的广泛应用潜力。
- 软提示(虚拟令牌)能在非语言空间中发现改变模型行为的复杂路径,为传统转向技术提供了额外维度。
- GitHub Copilot 的"用消息转向"通过向输出注入文本来改变行为;而激活级别的转向则直接作用于模型的内部表征。
- DwarfStar 4 不是 llama.cpp 的精简版本,而是一个独立项目,虽然借鉴了 llama.cpp 的一些创新,但代码重叠有限,主要集中在若干内核和量化模块。
- DeepSeek V4 Flash 能在配备 96–128GB 内存的 MacBook 上运行,且支持较大的上下文窗口,使其成为可用于本地推理的准前沿模型,但有用户反映与 Minimax M2.7 等替代方案相比,幻觉率更高。
讨论表明,人们越来越关注用转向向量和软提示对模型行为进行细粒度控制,尤其用于去除拒绝和定制响应。相比永久性未审查模型,能在运行时动态调整拒绝被认为更为可取。对话也触及能力与安全的张力——部分前沿模型在某些领域被刻意削弱。随着 DeepSeek V4 Flash 和 Minimax M2.7 能在高端消费级硬件上运行,本地推理变得更可行,但在幻觉率和效率上仍存在取舍。 • DwarfStar 4's steering features allow complete removal of refusal behavior in DeepSeek V4 at runtime, which is superior to modifying GGUFs because it minimizes damage to model capabilities by applying steering only when needed, such as during specific moments or when refusal-direction energy exceeds a threshold.
• DeepSeek V4 already exhibits minimal refusal behavior compared to Western AI models, but the anti-refusal steering vector enables it to answer even seemingly inappropriate requests, highlighting the model's inherent lack of censorship.
• Steering vectors offer a dynamic alternative to releasing permanently uncensored models, allowing users to selectively disable refusals for legitimate purposes like cybersecurity research without compromising accuracy on unrelated tasks.
• Anthropic has deliberately made Opus 4.7 worse at cybersecurity tasks despite improving general intelligence, illustrating the tension between capability and safety in frontier models.
• Uncensored models can exhibit unexpected behaviors, such as decompiling binaries to answer questions, which underscores the need for better sandboxing as less restricted models become more common.
• Steering can be used to shift a model's political ideology, demonstrating the technique's broad potential beyond just removing refusals.
• Soft prompts (virtual tokens) enable finding non-linguistic areas of meaning that change model behavior in complex ways, offering another dimension of control beyond traditional steering.
• GitHub Copilot's "steer with message" feature is a different kind of steering that injects text into the model's output, whereas activation-level steering operates directly on the model's internal representations.
• DwarfStar 4 is not a stripped-down version of llama.cpp but a separate project that builds on llama.cpp's innovations, with minimal code overlap limited to a few kernels and quantization code.
• DeepSeek V4 Flash can run on 96-128GB MacBooks with large context windows, making it a quasi-frontier model accessible for local inference, though some users report higher hallucination rates compared to alternatives like Minimax M2.7.
The discussion reveals a growing interest in fine-grained control over model behavior through techniques like steering vectors and soft prompts, particularly for removing refusals and customizing model responses. While DeepSeek V4 is noted for its minimal inherent censorship, the ability to dynamically adjust refusals at runtime is seen as superior to permanently uncensored models. The conversation also touches on the tension between safety and capability, with some frontier models being deliberately weakened in certain areas. Local inference of large models is becoming more feasible, with DeepSeek V4 Flash and Minimax M2.7 both capable of running on high-end consumer hardware, though trade-offs in hallucination rates and efficiency exist.