Show HN: Needle: We Distilled Gemini Tool Calling into a 26M Model
773 points
• 6 days ago
• Article
Link
Needle 是 Cactus Compute 开发的一款 2600 万参数的"Simple Attention Network",由 Gemini 3.1 蒸馏而来,专为在手机、手表、眼镜等消费端设备上高效运行而设计。该模型擅长个人 AI 场景下的一次性函数调用(single-shot function calling),在该任务上优于 FunctionGemma-270m 和 Qwen-0.6B 等更大模型。架构紧凑,采用编码器—解码器设计,使用了 GQA 、 RoPE 和 ZCRMSNorm 等技术;在 2000 亿 token 上完成预训练,随后在一个包含 20 亿 token 的函数调用数据集上进行了微调。
为设备端部署做了大量优化:在 Cactus 基础设施上运行时,预填充(prefill)速度可达每秒 6000 个 token,解码速度可达每秒 1200 个 token 。模型权重和数据集生成代码全部开源,可在 Hugging Face 的 Cactus-Compute/needle 仓库获取。网络由 12 层编码器和 8 层解码器组成,嵌入层权重共享,词表为 8192 个 BPE token 。
上手简单:克隆仓库并运行安装脚本后可在 localhost:7860 启动网页交互界面,方便测试模型并在自定义工具上进行微调。提供的 Python API 支持加载检查点、对输入进行 token 化并生成函数调用输出,接收文本查询和工具定义,返回结构化的函数调用参数。
在定制方面,Needle 同时提供网页界面和命令行工具,可在自定义数据集上微调。网页界面能利用 Gemini 生成合成训练数据、训练模型、评估性能并打包结果;CLI 支持完整训练流程、在 PleIAs/SYNTH 数据集上预训练、检查点评估,以及为有 Google Cloud 权限的用户管理 TPU 等操作。
尽管 Needle 在函数调用任务上表现突出,开发者也指出它仍属实验性、主要面向边缘设备的小型 AI 方向。更大的模型在对话能力和通用性上仍具优势。项目鼓励用户用自己的工具测试并根据需要微调,同时也提醒小模型有时表现可能较为不稳定。
Needle is a 26-million-parameter "Simple Attention Network" developed by Cactus Compute, distilled from Gemini 3.1 and designed to run efficiently on consumer devices like phones, watches, and glasses. The model specializes in single-shot function calling for personal AI applications, outperforming larger models like FunctionGemma-270m and Qwen-0.6B on this specific task. It features a compact architecture with an encoder-decoder structure, using techniques like GQA, RoPE, and ZCRMSNorm, and was pretrained on 200 billion tokens before being fine-tuned on a 2-billion-token function call dataset.
The model is optimized for on-device deployment, achieving impressive inference speeds of 6000 tokens per second for prefill and 1200 tokens per second for decode when running on Cactus infrastructure. Weights and dataset generation code are fully open-source, available on Hugging Face under the Cactus-Compute/needle repository. The architecture includes 12 encoder layers and 8 decoder layers, with tied embeddings and a shared vocabulary of 8192 BPE tokens.
Getting started with Needle is straightforward. Users can clone the repository and run the setup script to launch a web playground at localhost:7860, where they can test the model and fine-tune it on custom tools with minimal effort. The Python API allows for simple integration, with functions for loading checkpoints, tokenizing inputs, and generating function call outputs. The model accepts a text query and tool definitions, returning structured function call arguments.
For customization, Needle provides both a web interface and CLI tools for fine-tuning on custom datasets. The playground can generate synthetic training data using Gemini, train the model, evaluate performance, and bundle the results. The CLI supports various operations including full training runs, pretraining on the PleIAs/SYNTH dataset, evaluation of checkpoints, and TPU management for those with access to Google Cloud infrastructure.
While Needle excels at function calling, the developers note that it's experimental and focused specifically on redefining tiny AI for edge devices. Larger models still have advantages in conversational settings and general scope. The project encourages users to test the model with their own tools and fine-tune as needed, acknowledging that small models can sometimes be finicky in their behavior.
211 comments • Comments Link
• Needle 的小体积(14MB,INT4)为自然语言命令行界面带来了可能性,用户可以用简单英文描述来操控设备。尽管大家普遍兴奋,但也有人担心额外 14MB 的存储开销和计算成本。
• 该模型已部署在一个 HuggingFace Space 上,并配有简易的 Dockerfile,方便做实验。有人建议并正在制作 playground 的视频演示。
• 有用户表示困惑:有人一开始把模型大小误读为 26B 而不是 0.026B 。标注中的 M 和 B 被认为不够直观,建议直接写成 0.026B 以提高清晰度。
• 蒸馏能把大型模型智能压缩为更小的模型,从而减少磁盘空间、内存和计算需求,但代价是基准性能低于原模型。
• 有人质疑为何用 Gemini 作为对比对象,认为其它模型在工具调用能力上可能更强。一个解释是选择 Gemini 部分原因在于其 API 定价更为实惠;另有用户指出 Gemini 和 Kimi 可以用于类似场景。
• 主要应用场景是在资源受限的设备上部署 AI(如手机、手表、耳塞和智能眼镜)。具体例子包括智能家居控制(例如用语音切换灯光)以及在树莓派等定制硬件上的复古收音机等设备中增强语音助手功能。
• 性能反馈总体积极:有用户报告说它在设置闹钟和管理购物清单方面优于 Siri 。但关于失效模式的问题仍未完全解决,比如它如何处理无法识别的请求、模糊的工具定义或多步骤工具链调用。
• 该模型可以作为大型智能体流水线中的第一道工具调用器,把结果传给更强大的模型。目前还不支持上下文学习,但未来计划加入。代码库中包含数据集处理管道,可能会全部开源。
• 在无特权的 CPU LXC 容器中运行时遇到了一些技术问题,尽管理论上该模型应能在纯 CPU 设备上运行。 HuggingFace 上 tokenizer 仓库的访问问题已被迅速解决。
• 强调了开源理念与实用部署细节,鼓励社区进行尝试。部分社区成员提出了训练数据版权方面的担忧,但团队澄清蒸馏过程中并未访问原始模型权重。
讨论表明,Needle 的极小体积让在边缘设备上实现本地化 AI 成为热点,智能家居控制和语音助手等实用场景被频繁提及。有人质疑是否需要为语音助手开发专门的轻量模型,而不是沿用现有方案;也有人认为本地化带来的延迟降低和隐私保护具有明显优势。技术社区对开放部署反应热烈,许多用户已开始测试或计划集成。关于模型训练数据的版权与伦理问题虽被提及,但并未主导讨论,反映了 AI 社区中持续存在的紧张议题。 • Needle's small size (14MB, INT4) opens possibilities for natural-language command-line interfaces where users can describe actions in plain English. Excitement exists but concerns about 14MB overhead and computational cost remain.
• The model has been deployed to a HuggingFace Space with a simple Dockerfile, making it accessible for experimentation. A video demo of the playground was also suggested and is being created.
• Some users report confusion: one initially misread model size as 26B instead of 0.026B. The M vs B notation was deemed too subtle, with 0.026B suggested for clarity.
• Distillation compresses large model intelligence into smaller models requiring less disk space, memory, and compute. The tradeoff is lower benchmark performance compared to the source model.
• Questions arose over choice of Gemini for comparison when other models may have better tool-calling capabilities. Clarification offered that Gemini was chosen partly for its cheaper API pricing. Another user noted Gemini and Kimi could serve similar purposes.
• The primary use case is deploying AI on resource-constrained devices like phones, watches, earbuds, and glasses. Concrete examples include smart home control (e.g., toggling lights via voice) and enhancing voice assistants on custom hardware like Raspberry Pi-based retro radios.
• Performance anecdotes are positive: one user reported it outperformed Siri for setting alarms and managing shopping lists. Questions about failure modes remain, such as how it handles unrecognized requests, ambiguous tools, or multi-step tool chaining.
• The model could serve as a first-pass tool caller in a larger agent pipeline, passing results to a more capable model. In-context learning is not yet supported but planned. The dataset pipeline is included in the codebase, with potential full release.
• Technical issues running on CPU in unprivileged LXC containers were reported, though the model should ideally work on CPU-only devices. Access issues with the tokenizer repository on HuggingFace were promptly resolved.
• Open-source ethos and practical deployment details are emphasized, with encouragement for community experimentation. Copyright concerns around training data were raised by some community members, though the team clarified distillation did not access original weights.
The discussion reveals strong enthusiasm for Needle's minimal footprint enabling local AI on edge devices, with practical use cases around smart home control and voice assistants recurring throughout. Some skepticism persists around whether current voice assistant needs justify dedicated tiny models versus existing solutions, while others see clear latency and privacy advantages. The technical community responded warmly to the open deployment, with several users immediately testing or planning integrations. Copyright and ethical concerns around model training data surfaced but didn't dominate the conversation, reflecting broader ongoing tensions in the AI community.