Δ-Mem: Efficient Online Memory for Large Language Models
238 points
• 3 days ago
• Article
Link
本文提出了 δ-mem,一种轻量级的记忆机制,旨在帮助大型语言模型在长期助手和智能体系统中累积并重用历史信息。不同于扩展上下文窗口(既计算开销大又常常效果有限),δ-mem 在冻结的全注意力主干上增加了一个紧凑的在线联想记忆状态。该机制将历史信息压缩为固定大小的状态矩阵,并通过 delta 规则进行更新;在文本生成时,从该记忆读出信息,为主干的注意力计算提供低秩修正。
尽管在线记忆仅为 8×8 的小规模,δ-mem 仍显著提升了性能:平均得分比冻结主干高 1.10 倍,比最强的非 δ-mem 记忆基线高 1.15 倍;在内存密集型任务上提升更明显,在 MemoryAgentBench 上达到 1.31 倍,在 LoCoMo 上达到 1.20 倍。重要的是,这些改进并未削弱模型的通用能力。
δ-mem 的一大优势是无需完全微调、替换主干网络或显式扩展上下文即可运行,使其成为一种实用且高效的增强记忆方案。实验结果表明,通过与注意力计算直接耦合的紧凑在线状态,就能实现有效的长期记忆能力。
The paper introduces δ-mem, a lightweight memory mechanism designed to help large language models (LLMs) accumulate and reuse historical information in long-term assistants and agent systems. Rather than expanding the context window, which is computationally expensive and often ineffective, δ-mem augments a frozen full-attention backbone with a compact online state of associative memory. This approach compresses past information into a fixed-size state matrix that is updated using delta-rule learning. During text generation, the system uses a readout from this memory to produce low-rank corrections to the backbone's attention computation.
Despite its small size, an 8×8 online memory state, δ-mem significantly boosts performance. On average, it achieves scores 1.10 times higher than the frozen backbone and 1.15 times higher than the strongest non-δ-mem memory baseline. The gains are even more pronounced on memory-intensive tasks, reaching 1.31 times on MemoryAgentBench and 1.20 times on LoCoMo. Importantly, these improvements come without compromising the model's general capabilities.
A key advantage of δ-mem is that it operates without requiring full fine-tuning, backbone replacement, or explicit context extension. This makes it a practical and efficient solution for enhancing memory in LLMs. The results demonstrate that effective memory can be achieved through a compact online state that is directly coupled with the attention computation.
60 comments • Comments Link
• 标题被 Hacker News 的自动大小写规则改动,把小写的 delta(δ)误改为大写的 Δ,改变了原意。这暴露了一个更普遍的问题:自动化系统可能会扭曲技术命名规范,尤其在数学和物理领域大小写敏感时影响重大。
• 强烈建议以字节为单位标准化报告运行模型所需的最小 RAM 。仅给出参数数量而不说明精度(例如 FP16 与 INT4)会产生误导。这样也能更清晰地呈现 Mixture-of-Experts(MoE)模型的权衡——当内存受限时,更高的内存需求可能不足以证明性能提升的合理性。
• δ-Mem 通过 delta-rule learning 将过去信息压缩到固定大小的状态矩阵,但这并未真正解决内存容量的根本问题。输入的微小变化会引发截然不同的激活模式,使得有效缓存变得困难。真正的记忆改进需要语义检索,即语义相似的输入能触发相同的缓存响应。
• 虽然存在理论上的限制,但在理想情况下,拥有 300M 参数(例如 Llama 3 8B 在 10K 上下文时的 KV 缓存)的固定大小状态,理论上可以编码多达 100M token 的信息。实际模型尚未达到这一上限,但这为高效记忆压缩的未来研究提供了希望。
• 替代的内存管理方法包括使用动态生成的正则表达式来过滤相关的上下文块,以避免冗余信息导致的注意力退化。这在概率性 LLM 行为与确定性模式匹配之间架起了桥梁,提高了效率。
• 当前的内存系统会随数据增加而退化,类似于 FIFO,但细节丢失或损坏会逐步加剧。当达到上下文限制并尝试压缩时,这种不稳定行为尤为明显。
• 对于编码型代理,传统的记忆框架通常并非必要。 Agent 的技能、规则、 git 历史和文档更高效且透明。记忆系统更适合面向消费者的代理——那些具有受控上下文和受限能力的场景。
• 基于 CLI 的工具如 Beads 或 ticket 为 LLM 记忆提供了实用替代方案,借助现有 Unix 工具和文件系统,避免重复发明接口,同时利用人类与 LLM 都能使用的既有工具链。
• Agent 未充分利用 git 历史,尽管其在记录修复和架构决策方面非常有价值。明确指示 Agent 查阅 git 历史并维护结构化文档(例如 Claude.md)可提升性能并减少重复错误。
• 记录所有用户消息和 Agent 操作的任务执行框架,使 Judge Agent 能有效审查决策。这种多 Agent 工作流——计划、判断、执行、复审——虽然消耗更多 token,但通过保留上下文和允许外部审查显著改善了结果。
• 重用与过去任务类似的解决方案可以节省大量能量与计算资源。像 PushRealm 这样的平台旨在构建类似 StackOverflow 的知识库,让 Agent 共享解决方案以避免重复劳动。
• 建议考虑神经形态计算作为更节能的路径,模仿大脑只保存有用记忆并从经验中泛化的能力。但也有人认为以纯文本保存基本记忆更简单、实用。
• 许多 LLM 任务可以通过简单脚本或 Unix 工具更高效地完成。 Agent 往往默认写代码,而将现有工具(如 sed 、 grep)通过管道串联起来通常更快、更可靠,尤其在文本处理上。
• 尽管对其新颖性持怀疑态度,δ-Mem 把 DeltaNet 超网络集成到现有 LLM 中,带来了中等兴趣但并非突破性的进展。关于其计算成本以及是否会导致过拟合或数据泄漏的问题仍未消除。
• 论文在 Hugging Face 的可见度(当日第 3 名)并不算突出,考虑到每周提交的大量高知名度论文。这种中性反响表明工作扎实但在当前研究格局中并无显著异彩。
讨论总体上对 δ-Mem 的主张持怀疑态度,参与者强调固定大小的内存压缩并未触及语义检索和上下文稳定性的核心难题。对于 Agent 的实际内存解决方案,应优先考虑透明度与效率,倾向于结构化文档、 git 历史和 CLI 工具,而不是不透明的学习型记忆系统。尽管理论上承认信息编码的潜力,但实际性能差距仍然显著。社区重视可重复性与标准化(例如以字节报告内存)以及重用既有解决方案,反映出对确定性与可审计方法的偏好,而非纯粹的概率性方法。 • The title was altered by Hacker News' automatic casing rules, which incorrectly changed the lowercase delta (δ) to uppercase (Δ), changing the intended meaning. This highlights a broader issue where automated systems can distort technical nomenclature, especially in math and physics where case sensitivity is critical.
• There is a strong call for standardized reporting of the minimum RAM required to run a model in bytes, as parameter count alone is misleading without specifying precision (e.g., FP16 vs. INT4). This would also clarify trade-offs for Mixture-of-Experts (MoE) models, where larger memory requirements may not justify performance gains if memory is the constraint.
• δ-Mem compresses past information into a fixed-size state matrix using delta-rule learning, but this does not solve the fundamental capacity problem of memory. Slight input variations cause vastly different activations, making effective caching difficult. True memory improvement requires contextual search, where semantically similar inputs trigger the same cached response.
• Despite theoretical limits, a fixed-size state with 300M parameters (like Llama 3 8B's KV cache at 10K context) could encode up to 100M tokens of information under ideal conditions. While real models fall short of this ceiling, it shows promise for future research in efficient memory compression.
• Alternative approaches to memory management include using dynamically generated regex to filter relevant context blocks, avoiding attention degradation from redundant information. This bridges probabilistic LLM behavior with deterministic pattern matching, improving efficiency.
• Current memory systems degrade over time as more data is added, similar to FIFO but with increasing loss or mangling of details. This erratic behavior emerges when context limits are reached and compaction is attempted.
• For coding agents, traditional memory frameworks are often unnecessary. Agent skills, rules, git history, and documentation are more efficient and transparent. Memory systems are more suited for consumer-facing agents with managed context and limited capabilities.
• CLI-based tools like Beads or ticket offer a practical alternative to LLM memory, using existing Unix utilities and file systems. This avoids reinventing interfaces and leverages tools already usable by both humans and LLMs.
• Git history is underutilized by agents, despite its value in documenting bug fixes and architecture decisions. Explicitly instructing agents to consult git history and maintain structured documentation (e.g., in Claude.md) improves performance and reduces repeated mistakes.
• A task execution harness that logs all user messages and agent actions enables judge agents to review decisions effectively. This multi-agent workflow—plan, judge, execute, judge—consumes more tokens but significantly improves outcomes by preserving context and enabling external review.
• Reusing solutions from similar past tasks could save significant energy and computation. Platforms like PushRealm aim to create a StackOverflow-like knowledge base where agents share solutions to avoid redundant problem-solving.
• Neuromorphic computing is suggested as a more energy-efficient path forward, mimicking the human brain's ability to store only useful memories and generalize from past experiences. However, others argue that preserving essential memories in plain text is simpler and more practical.
• Many LLM tasks could be more efficiently handled by simple scripts or Unix tools. Agents often default to writing code when piping together existing utilities (e.g., sed, grep) would be faster and more reliable, especially for text processing.
• Despite skepticism about novelty, δ-Mem integrates DeltaNet hypernetworks into existing LLMs, offering moderate interest but not groundbreaking advancement. Questions remain about computational cost and whether the method risks overfitting or data leakage.
• The paper's visibility on Hugging Face (#3 of the day) is unremarkable given the volume of high-profile submissions weekly. This neutral reception suggests the work is solid but not exceptional within the current research landscape.
The discussion reveals skepticism toward δ-Mem's claims, with participants emphasizing that fixed-size memory compression does not inherently solve the core challenge of semantic retrieval and contextual stability. There is broad consensus that practical memory solutions for agents should prioritize transparency and efficiency, favoring structured documentation, git history, and CLI tools over opaque learned memory systems. While theoretical limits of information encoding are acknowledged, real-world performance gaps remain significant. The community values reproducibility, standardization (e.g., reporting memory in bytes), and reuse of past solutions, reflecting a preference for deterministic, auditable methods over purely probabilistic approaches.