SANA-WM, a 2.6B open-source world model for 1-minute 720p video
401 points
• 2 days ago
• Article
Link
SANA-WM 是 NVIDIA 研究人员开发的一款 26 亿参数开源世界模型,能够从单张起始图像和相机轨迹生成高保真 720p 、最长可达一分钟的视频。该模型兼顾效率与质量:只需 64 块 H100 GPU 训练 15 天,推理时仅需单块 GPU 。其蒸馏版在 RTX 5090 上配合 NVFP4 量化,仅需 34 秒就能对一段 60 秒的 720p 视频完成去噪,使分钟级世界建模更容易普及。
SANA-WM 的表现源自四项核心创新。混合线性注意力将逐帧门控的 DeltaNet 与周期性 softmax 注意力结合,在保持长距离上下文连贯性的同时节省内存,避免了纯 softmax 模型在 60 秒时长下常见的内存溢出问题。双分支相机控制同时采用全局位姿分支与像素对齐的精细几何分支,以高保真度跟踪 6-DoF 相机轨迹。两阶段生成管线把第一阶段的结果输入到一个 170 亿参数的长视频精化器,用于提升整个序列的纹理、运动和一致性。最后,鲁棒的标注管线能从公开视频中提取精确的度量级 6-DoF 相机位姿,生成约 21.3 万个带有高质量时空一致动作标签的片段用于训练。
该模型擅长从静止的第一人称视角生成多样化的自主动画场景。演示涵盖雪山小径、水下古庙、外星沼泽到后末日高速公路等环境,其中漂浮的雪粒、摇曳的植被、闪烁的火焰和流动的水等独立运动元素在整段一分钟的视频中自然持续。 SANA-WM 还支持可控的相机轨迹,示例显示在相同起始帧上沿盐滩、冰湖和丛林峡谷等不同路径移动,精确遵循指定的 6-DoF 运动。
在作者提出的一分钟世界模型基准测试中,SANA-WM 在动作跟随准确性上优于此前的开源基线,同时在视觉质量上与 LingBot-World 和 HY-WorldPlay 等大型工业模型相当,但吞吐量提升了 36 倍。两阶段精化显著改善了后期时间窗的画质、纹理细节和运动平滑度,有效缓解了长时长视频生成中常见的退化问题。高效的训练流程、单 GPU 部署能力与开源可用性共同使 SANA-WM 成为向实用化、高质量世界建模迈出的重要一步。
SANA-WM is a 2.6B-parameter open-source world model developed by NVIDIA researchers that generates high-fidelity, 720p videos lasting up to one minute from a single starting image and a camera trajectory. The model is designed for efficiency and quality, capable of being trained on just 64 H100 GPUs over 15 days and running inference on a single GPU. Its distilled variant can denoise a 60-second 720p clip in only 34 seconds on an RTX 5090 using NVFP4 quantization, making minute-scale world modeling more accessible.
Four core innovations drive SANA-WM's performance. Hybrid Linear Attention combines frame-wise Gated DeltaNet with periodic softmax attention to maintain coherent long-range context while staying memory-efficient, avoiding the out-of-memory issues that plague all-softmax models at 60-second durations. Dual-Branch Camera Control uses both a global pose branch and a fine pixel-aligned geometric branch to precisely follow 6-DoF camera paths with high fidelity. A Two-Stage Generation Pipeline feeds stage-1 outputs into a dedicated 17B long-video refiner that sharpens texture, motion, and consistency across the full sequence. Finally, a Robust Annotation Pipeline extracts accurate metric-scale 6-DoF camera poses from public videos, producing about 213K clips with high-quality spatiotemporally consistent action labels for training.
The model excels at generating diverse, autonomously animated scenes from stationary first-person viewpoints. Demos showcase environments ranging from snowbound alpine trails and underwater ancient temples to alien swamps and post-apocalyptic highways, where independent motion like drifting snow particles, swaying vegetation, flickering flames, and flowing water continues naturally throughout the minute-long clips. SANA-WM also supports controllable camera trajectories, with examples showing different paths taken from the same starting frame across salt flats, frozen lakes, and jungle canyons, demonstrating precise adherence to specified 6-DoF movements.
On the authors' one-minute world-model benchmark, SANA-WM achieves stronger action-following accuracy than prior open-source baselines while delivering comparable visual quality to large-scale industrial models like LingBot-World and HY-WorldPlay at 36 times higher throughput. The two-stage refinement process notably improves late-window quality, texture detail, and motion smoothness compared to stage-one outputs alone, addressing common degradation issues in long-duration video generation. The combination of efficient training, single-GPU deployment, and open-source availability positions SANA-WM as a significant step toward practical, high-quality world modeling.
152 comments • Comments Link
在游戏中,人工精心打造的意图感——以 FromSoftware 对物品摆放的细心设计为例——能营造出逼真而沉浸的体验。但目前尚不清楚世界模型是否可以复制这种刻意的设计层次,或能否被开发者以模块化方式用于创造有意的体验。
• AI 生成的内容可能会充斥着看似合理但空洞的体验,使挑剔的观众更难发现真正优质的作品,类似于亚马逊的市场机制把消费者推向排名靠前的商品,而不论其实际价值如何。
• 世界模型在游戏之外还有很大潜力,尤其在机器人训练和模拟领域。它们可以帮助机器人预测动作后果,发展出比当前大语言模型更强的空间推理能力,而后者在基本物理任务上通常表现不佳。
• 《 Dwarf Fortress 》和《 Minecraft 》等游戏的程序化生成表明,缺乏人工刻意意图并不妨碍吸引力;相反,精心设计的系统能产生连设计师都未预见到的涌现玩法。
• 游戏市场高度多样化,达到 FromSoftware 那种品质的作品不到市场的 5%,这意味着 AI 辅助写作和设计有可能改善目前普遍被认为质量较低的大部分内容。
• 世界模型能够催生新型互动娱乐形式,例如那种每个场景都是独立精心构建的世界——未被触碰时像电影,参与后则变成可互动的叙事体验。
• 当前世界模型在连贯性上存在严重问题。视频演示显示,当镜头回到先前展示过的区域时会出现明显错误;即便是最好的闭源视频模型,也难以处理涉及人类的长时间内容。
• 世界模型的商业价值仍不明朗,目前尚未创造出显著收入。尽管如此,机器人训练、作为 AI 代理的视频界面以及某些娱乐应用被视为有前景的市场。
• 在游戏开发工作流中,世界模型可以用于中期生成资产,创建把时间不一致性作为设定一部分的程序化体验,或让关卡设计师通过提示来细化并快速迭代生成内容。
• 该技术代表了朝向数字孪生和通用机器人等更高级应用迈出的一步:学习型模拟器有可能取代手工编码的模拟器,遵循"数据驱动方法最终优于手工工程"的原则。
讨论中既有对世界模型技术成就的赞赏,也有对其复制人工意图能力的怀疑。一些参与者看到了程序化生成与涌现玩法的潜力,另一些则担心未来会被大量空洞、缺乏个性的内容淹没。就近期期望而言,机器人模拟与训练似乎比娱乐用途更现实,因为该技术在物理一致性和长期连贯性上仍面临重大障碍。同时也有人认为,游戏市场足够多样,既能容纳精心制作的体验,也能接纳程序化生成的世界——这表明这些工具更可能扩展创意可能性,而不是简单替代人类的匠心。 • Hand-crafted intentionality in games, exemplified by FromSoftware's meticulous object placement, creates immersive experiences that feel alive, and it's unclear whether world models can replicate this level of deliberate design or be used modularly by human developers to create intentional experiences.
• AI-generated content risks flooding the world with superficially plausible but hollow experiences, making it harder for discerning audiences to find genuine quality, similar to how Amazon's marketplace dynamics push consumers toward top-listed products regardless of actual value.
• World models have significant potential beyond gaming, particularly for robotics training and simulation, where they can help robots predict consequences of actions and develop better spatial reasoning than current LLMs, which often fail at basic physical tasks.
• Procedural generation in games like Dwarf Fortress and Minecraft demonstrates that lack of hand-crafted intentionality can be central to a game's appeal, with carefully crafted systems producing emergent phenomena that even designers haven't seen before.
• The gaming market is diverse, with FromSoftware-quality titles representing less than 5% of the market, meaning AI writing and design could potentially improve the majority of games that are currently considered low quality.
• World models could enable new forms of interactive entertainment, such as narrative experiences where each scene is a carefully crafted world that behaves like a film when untouched but becomes an interactive narrative game when engaged with.
• Current world models face significant consistency issues, with videos showing glaring problems when camera directions shift back to previously shown areas, and even the best closed-source video models struggle with long-form content involving humans.
• The practical utility of world models remains uncertain, with no meaningful revenue currently being generated, though promising markets include robotics training, video interfaces for AI agents, and entertainment applications.
• World models could be used in game development workflows to generate assets during development, create procedural experiences that incorporate temporal inconsistency into the setting, or enable level designers to prompt for details and iterate on generated content.
• The technology represents a step toward more advanced applications like digital twins and general-purpose robotics, where learned simulators would replace hand-coded ones, following the principle that data-driven approaches eventually outperform manually engineered solutions.
The discussion reveals a tension between appreciation for the technical achievements of world models and skepticism about their ability to replicate the intentionality that makes hand-crafted experiences meaningful. While some participants see potential for procedural generation and emergent gameplay, others worry about a future flooded with hollow, impersonal content. The most promising near-term applications appear to be in robotics simulation and training rather than entertainment, where the technology's limitations in physical consistency and long-term coherence remain significant barriers. There's also recognition that the gaming market is diverse enough to accommodate both meticulously designed experiences and procedurally generated worlds, suggesting these tools may expand creative possibilities rather than simply replacing human craftsmanship.