The last six months in LLMs in five minutes
Simon Willison 在 PyCon US 2026 的闪电演讲回顾了过去半年来大语言模型的剧烈变化,他把这一切的起点称为"2025 年 11 月的拐点"。这段时间里,"最佳模型"的称号在 Anthropic 、 OpenAI 和 Google 之间多次易手——先后轮到 Claude Sonnet 4.5 、 GPT-5.1 、 Gemini 3 、 GPT-5.1 Codex Max,最终落到 Claude Opus 4.5 。 Willison 用他的标志性测试——生成一幅鹈鹕骑自行车的 SVG——来展示各模型的差异,指出虽然 Gemini 3 画得最像样,但 Opus 4.5 在实际使用中通常被认为更强。
不过,11 月真正的突破不止于模型质量——而是编程代理第一次变得真正有用。得益于以可验证奖励为基础的大规模强化学习,OpenAI 的 Codex 、 Anthropic 的 Claude Code 等工具跨过了一个门槛,能成为日常软件开发的主力,而不需要事事手动修错。这一变化让开发者敢于用 AI 辅助构建更宏大的项目。 Willison 在 12 月到 1 月的假期里亲身经历了这种狂热,他和其他人放手试验新能力,掀起一段短暂的"LLM 精神病"时期——比如他动手做了一个用 Python 实现、在浏览器里通过 WebAssembly 运行的 JavaScript 解释器。
那段假期的实验既有惊艳的 demo,也暴露出许多实际局限。 Willison 的微型 JavaScript 项目,作为一个多层运行时栈技术上很有趣,但最终没人真正需要。许多同时期的激进项目在热潮过后悄然退场。但有一个在 11 月底提交的项目幸存下来并茁壮成长,经历了从 Warelay 到 CLAWDIS 、 CLAWDBOT,再到 2026 年 2 月定名为 OpenClaw 。
OpenClaw 一炮走红,作为一种"个人 AI 助手"催生了一个通称为 "Claws" 的新门类,衍生出 NanoClaw 、 ZeroClaw 等类似项目。热度之高甚至让硅谷的 Mac Mini 一度脱销——人们买来本地运行自己的 Claw 助手。 Willison 用两个比喻形容这些助手:一是需要水族箱来养的数字宠物;二是 Spider-Man 2 里 Doc Ock 的 AI 能量爪——只要抑制芯片完好它们是安全的,一旦失效就可能变得危险。
2 月,Google 推出 Gemini 3.1 Pro,它画出了 Willison 迄今为止最出色的鹈鹕插图:篮子里还叼着一条鱼,甚至能生成各种动物骑在交通工具上的动画。这引发了有人猜测各大实验室可能在针对他的鹈鹕测试进行训练,尽管 Willison 认为这个测试太荒诞,没人会专门去训练。该模型还生成了一幅弗吉尼亚负鼠骑电动滑板车的图,配文 "Cruising the commonwealth since dusk",这是其他模型没能匹敌的。
2026 年 4 月,Google 和中国多家 AI 实验室都进行了重要发布:Google 的 Gemma 4 系列成为美国公司中最强的开源权重模型,而中国的 GLM 推出 GLM-5.1 —— 一款高达 1.5 TB 的开源权重模型,为拥有充足硬件的用户提供强劲性能。 Qwen 推出的 Qwen3.6-35B-A3B 仅 20.9 GB,能在笔记本上运行,甚至比 Claude Opus 4.7 画出了更好的鹈鹕。 Willison 指出,这也说明他的鹈鹕基准已经不再适合作为严肃评估工具。
Willison 总结说,过去六个月里有两条主线:编程代理真正成为日常开发的生产力工具,以及能在消费级硬件上运行的本地模型进步远超预期。尽管这些本地模型仍不及云端的最前沿系统,但它们的快速提升让复杂 AI 能力无需昂贵基础设施就能普及,这标志着实际 AI 部署格局的重大转变。
Simon Willison's lightning talk at PyCon US 2026 covered the dramatic developments in large language models over the preceding six months, starting from what he calls the "November 2025 inflection point." During that period, the title of "best model" shifted five times among Anthropic, OpenAI, and Google, with models like Claude Sonnet 4.5, GPT-5.1, Gemini 3, GPT-5.1 Codex Max, and finally Claude Opus 4.5 taking the crown in quick succession. Willison uses his signature test, generating an SVG of a pelican riding a bicycle, to illustrate the varying capabilities of each model, noting that while Gemini 3 drew the best pelican, Opus 4.5 was generally considered the strongest overall for practical use.
The real breakthrough in November wasn't just model quality but the moment coding agents became genuinely useful. Thanks to extensive reinforcement learning from verifiable rewards, tools like OpenAI's Codex and Anthropic's Claude Code crossed a threshold where they could be used as daily drivers for real software development without constant debugging of their mistakes. This shift enabled developers to build ambitious projects with AI assistance, as Willison himself experienced during the December-January holiday break, when he and others experimented wildly with the new capabilities, leading to a brief period of "LLM psychosis" where he started projects like a JavaScript interpreter implemented in Python running in WebAssembly in the browser.
That holiday experimentation period produced both impressive demos and lessons about practical limits. Willison's micro-javascript project, while technically interesting as a multi-layered runtime stack, turned out to be something nobody actually needed. Many similar ambitious projects from that era were quietly retired once the initial excitement faded. However, one project that started with a commit in late November did survive and thrive, going through multiple name changes from Warelay to CLAWDIS to CLAWDBOT and finally becoming OpenClaw by February 2026.
OpenClaw emerged as a breakout hit, a "personal AI assistant" that sparked a new category generically called "Claws" based on similar projects like NanoClaw and ZeroClaw. The phenomenon became so popular that Mac Minis started selling out in Silicon Valley as people bought them to run their Claw assistants locally. Willison offers two metaphors for these AI assistants: digital pets that need an aquarium, and Doc Ock's AI-powered claws from Spider-Man 2, which were safe as long as nothing damaged their inhibitor chip but could turn dangerous otherwise.
February also brought Google's Gemini 3.1 Pro, which produced Willison's best pelican illustration yet, complete with a fish in the basket, and even generated animated versions of various animals on vehicles. This led to speculation that AI labs might actually be training on his pelican test, though Willison maintains it's too ridiculous a task for deliberate training. The model's capabilities were further demonstrated when it successfully generated a North Virginia Opossum on an E-scooter with the caption "Cruising the commonwealth since dusk," a result that other models couldn't match.
April 2026 saw significant releases from both Google and Chinese AI labs. Google's Gemma 4 series became the most capable open-weight models from a US company, while China's GLM released GLM-5.1, a massive 1.5 terabyte open-weight model that delivers strong performance for those with sufficient hardware. Qwen's Qwen3.6-35B-A3B, at just 20.9 gigabytes, proved capable of running on a laptop and even drew a better pelican than Claude Opus 4.7, though Willison notes this suggests his pelican benchmark has outlived its usefulness as a serious evaluation tool.
Willison concludes that the past six months have been defined by two major themes: coding agents becoming genuinely productive tools for daily development work, and local models that run on consumer hardware dramatically exceeding expectations. While these local models remain weaker than frontier cloud-based systems, their rapid improvement has made sophisticated AI capabilities accessible without expensive infrastructure, marking a significant shift in the landscape of practical AI deployment.
197 comments • Comments Link
一位企业培训师抱怨公司强制要求用 AI 做课程规划和幻灯片,觉得这是一种随波逐流,剥夺了教学中个人专业知识和经验的呈现。尽管大多数同事不加批判地接受,只把 AI 当作备课工具。
企业内部 AI 部署显示,办公室职员对 Copilot 和 ChatGPT 在处理基础任务上的表现感到惊讶,而能够大规模自动化工作的智能体则为技术能力较弱的用户带来了"魔法"般的体验。 Claude in Office 已成为非技术员工的分水岭工具,能做出精美幻灯片,并减少财务部门对 BI 支持的依赖。
有团队采用的详细工作流是在 VS Code 里配合 markdown 模板和 GitHub Copilot,每一步都借助 AI 把内容转换为 Word 和 PowerPoint,偏好用 markdown 以便版本控制。个人使用场景包括用 AI 管理邮件和文件整理;语言导师用 AI 根据校方教学大纲为学生生成新练习,从而加速学习进度。前数据科学家报告说,他在过去三个月从简单的聊天补全过渡到用代码智能体完成几乎所有文档输出任务。一位非科技行业的编辑则表示,尽管 AI 有进步,过去四年的工作并未发生本质变化。
在软件开发方面,使用最新模型进行的 vibe coding 在构建完整应用时仍有困难,虽然能迅速生成基础版本,但表明市场宣传可能超过了能力提升。成功的 vibe coding 需要大量前期设计文档、分阶段实施计划、 TDD 、自动化代码审查和彻底 QA,智能体之间还会迭代审查彼此的输出。 2025 年 11 月的 Opus 4.5 被视为一个真正的转折点,不再需要手把手指令,一位开发者在该版本发布后完全停止了专业编码工作。
上下文窗口的改进,尤其是达到 100 万 token,显著扩大了 AI 在退化前能处理的任务范围,尽管有些人发现最佳区间在 100k–500k token 左右。非编码人员报告称,在构建爬虫和数据处理管道等工具方面已有显著改进,原本需要 10–100 倍时间的任务现在可以交给 AI 完成,因为 AI 已经具备足够的领域知识来被有效指导。
安全研究人员指出,AI 大规模发现漏洞是另一个关键转折点,短期内导致了因减少新漏洞引入与增加旧漏洞发现而带来的混乱。原先用来衡量模型能力的鹈鹋骑自行车 SVG 基准对现代模型已显得微不足道,新的基准如负鼠骑电动滑板车逐渐出现以跟上领域快速进步。 DeepSeek V4-Flash 使上下文缓存几乎"免费",代表了一个被低估的效率飞跃。
顶级模型之间的差异在处理复杂任务的阈值上尤为明显,直接对比揭示出 casual 使用看不见的实质性性能差距。 RLVR 的改进提升了可验证任务(如代码和数学)的表现,其他领域的收益则不那么显著,引发了对泛化能力限制的质疑。一个根本性限制是,AI 擅长模式合成但缺乏对代码更高层次语义的理解,暗示模型更可能在"宽度"上扩展而非"高度"上突破。
目前公司已经开始裁减工程团队约三分之一,并完全取消 QA,尽管人们担心 vibe coding 系统反而需要更多验证而不是更少。总体上,最有效的 AI 应用场景仍是模式匹配和在人类帮助下的漏洞发现;代码和长文生成仍显平庸,对智能体任务的可靠性不足。讨论显示早期采用者之间存在巨大分歧:部分人确实体验到显著的生产力提升,而怀疑论者则质疑这些能力是否被过度炒作。技术用户强调,真正的变革来自合适的方法论——大量前期设计、分阶段实施和严格测试——而非单纯依赖 vibe coding 。与此同时,非技术员工正经历他们自己的转折点,组织内对无明确理由强制使用 AI 的担忧也在增长。安全社区预计随着 AI 在漏洞发现上的加速会出现更多混乱,而关于模式合成是否因缺乏真正理解而构成永久性限制的根本问题仍未解决。裁员和 QA 团队削减已经在进行,工程人员减少约三分之一,尽管有人预测市场将扩大并带来新机会,这一过渡期仍伴随焦虑。 • A corporate instructor describes being pressured to use AI for lesson planning and slide creation, viewing it as a bandwagon trend that strips teaching of personal expertise and experience, with most colleagues uncritically embracing it despite using it only for preparation tasks.
• Enterprise AI deployment shows office workers are amazed by Copilot and ChatGPT for basic tasks, while agents that automate work at scale create a magical experience for nontechnical users.
• Claude in Office has become a tipping point for nontechnical workers, producing immaculate slide decks and reducing the need for BI help in finance departments.
• A detailed workflow for creating presentations uses VS Code with markdown templates and GitHub Copilot, converting to Word and PowerPoint with AI assistance at each step, preferring markdown for version control.
• Personal use cases include AI for email management and file organization, while a language tutor uses AI to generate fresh practice content for students based on school lesson plans, resulting in faster improvement.
• A former data scientist now uses code agents for nearly all document output tasks, having transitioned from simple chat completions three months ago.
• An editor in a non-tech industry reports no change in their work over the past four years despite AI advancements.
• Vibe coding with latest models still struggles with fully fledged applications, though it can produce barebones versions quickly, suggesting marketing may outpace actual capability improvements.
• Successful vibe coding requires extensive upfront design documentation, phased implementation plans, TDD, automated code reviews, and thorough QA, with agents reviewing each other's output iteratively.
• Opus 4.5 in November 2025 marked a genuine inflection point where hand-holding became unnecessary, with one developer stopping professional coding entirely after that release.
• Context window improvements, particularly 1 million tokens, significantly increase the scope of tasks AI can handle before degrading, though some find the optimal zone is around 100k-500k tokens.
• Non-coders report dramatic improvements in building tools like scrapers and data processing pipelines, with tasks that would have taken 10-100x longer now achievable with sufficient domain knowledge to guide the AI.
• Security researchers note a major inflection point with AI finding vulnerabilities at scale, leading to concerns about both reduced vulnerability introduction and increased discovery of old bugs creating short-term chaos.
• The pelican-on-a-bicycle SVG benchmark has become trivial for modern models, leading to new benchmarks like opossum-on-e-scooter as the field rapidly advances.
• DeepSeek V4-Flash has made context caching virtually free, representing an underappreciated efficiency improvement.
• Differences between top models become most apparent at the threshold of complex tasks, with head-to-head comparisons revealing substantial performance gaps invisible in casual use.
• RLVR improvements have boosted easily verifiable tasks like code and math, with less impressive gains in other domains, raising questions about generalization limitations.
• Companies are already reducing engineering teams by a third and eliminating QA entirely, despite concerns that vibe-coded systems require more verification, not less.
• The best AI use cases currently involve pattern matching and vulnerability discovery with human help, while code and prose generation remains mediocre and unreliable for agentic tasks.
• A fundamental limitation emerges where AI excels at pattern synthesis but lacks understanding one level of abstraction above the code, suggesting models will get wider before they get higher in capability.
The discussion reveals a sharp divide between enthusiastic early adopters experiencing genuine productivity gains and skeptics questioning whether current capabilities justify the hype. Technical users report transformative improvements with proper methodology, emphasizing that success requires extensive upfront design, phased implementation, and rigorous testing rather than pure vibe coding. Meanwhile, nontechnical workers are experiencing their own inflection point with tools like Claude in Office, though concerns emerge about organizations mandating AI use without clear justification. The security community anticipates chaos as AI vulnerability discovery accelerates, while fundamental questions persist about whether pattern synthesis without true understanding represents a permanent limitation. Workforce reductions are already occurring, with QA teams eliminated and engineering staff cut by a third, creating anxiety about the transition period despite predictions of expanded markets and new opportunities.