Kioxia and Dell cram 10 PB into slim 2RU server
138 points
• 2 days ago
• Article
Link
Kioxia 与 Dell 合作,推出了一款超高密度存储服务器:在仅 2RU 的机箱内集成了约 10 PB 的闪存容量。 Dell PowerEdge R7725xd 服务器配备 40 块 Kioxia LC9 E3.L 规格的 NVMe SSD,每块容量为 245.76 TB,总容量约为 9.8 PB;系统采用 AMD EPYC 9005 处理器,并可支持多达五块 400 Gbps 网卡以实现高速数据传输。 Dell 的 Arun Narayanan 指出,这一组合在不牺牲性能的前提下,提供了扩展 AI 基础设施所需的存储密度和能效。若在一个机架中部署 20 台此类服务器,理论总存储可达约 196 PB 。
Kioxia 的 Neville Ichhaporia 强调,这类服务器能为客户带来显著好处:支持海量数据摄入流、轻松扩展数据湖,并在更小的物理空间内处理大规模备份,从而大幅降低总体拥有成本。 LC9 SSD 是实现这一目标的关键,属于当前市面上容量最大的固态硬盘之一。其他主要存储厂商也在推进超大容量 SSD 领域的布局,包括 Micron 的 6600 ION 、 Sandisk 的 UltraQLC SN670,以及 SK Hynix 的 AIN D 和其子公司 Solidigm 的相关产品。
展望未来,Scality 透露 Samsung 正研发容量可达 1 PB 的近线级 SSD,被视为潜在的传统硬盘驱动器替代品,这反映出业界正持续朝着大幅提升固态存储密度的方向发展。 Kioxia 与 Dell 的合作展示了这些高容量 SSD 在需要兼顾性能与密度的场景(如 AI 基础设施)中的实际应用价值。
Kioxia and Dell have partnered to create an extremely dense storage server, packing 10 petabytes of flash capacity into a slim 2RU form factor. Dell's PowerEdge R7725xd server uses 40 of Kioxia's LC9 E3.L form factor NVMe SSDs, each with a capacity of 245.76 TB, to achieve a total of 9.8 PB. The system is powered by AMD EPYC 9005 processors and supports up to five 400 Gbps network interface cards for high-speed data transfer. Dell's Arun Narayanan highlighted that this combination delivers the storage density and power efficiency needed to scale AI infrastructure without compromising performance. A single rack containing twenty of these servers could hold a massive 196 PB of storage.
Kioxia's Neville Ichhaporia emphasized the benefits for customers, noting that these servers enable massive ingestion streams, effortless scaling of data lakes, and handling large backups in a fraction of the physical footprint, significantly improving total cost of ownership. The LC9 SSDs are central to this achievement, representing some of the highest-capacity drives available. Other major storage developers are also pushing into this ultra-high-capacity SSD space, including Micron with its 6600 ION, Sandisk with the UltraQLC SN670, and SK Hynix with its AIN D drive, along with its Solidigm subsidiary.
Looking even further ahead, Scality has revealed that Samsung is developing nearline-class SSDs with capacities reaching up to 1 petabyte, which are seen as potential replacements for traditional hard disk drives. This indicates a continued industry trend toward vastly increasing solid-state storage density. The collaboration between Kioxia and Dell demonstrates a practical application of these high-capacity SSDs, targeting demanding workloads like AI infrastructure where both performance and density are critical.
110 comments • Comments Link
• 轨道 CDN 比在轨数据中心更现实的近期应用,因为将媒体内容缓存到 LEO 或 GEO 卫星可以缓解像 Starlink 这样的星座的带宽压力。对于非实时内容传输来说,较高的延迟是可以接受的。
• 先进 SSD 对辐射敏感,不适合直接用于太空;RAD750 CPU 使用 150 纳米工艺,而现代 SSD 依赖更高密度、更易受辐射影响的晶体管,若无大量加固会很快退化。
• 尽管像 AMD 的 7nm Versal SoC 等较新芯片已被用于 Starlink 卫星,但 FPGA 在辐照环境下需要额外工程措施(如比特流擦洗和三模冗余),表明商用现成组件要在太空中可靠运行仍需大幅改造。
• 闪存可能在 5 年寿命的 Starlink 卫星任期内无法长期保持数据完整性;考虑到成本和可靠性限制,目前尚不清楚 LEO 服务器能否比地面数据中心更好地解决这一问题。
• RAD750 虽为极端辐射环境设计且已过时,但 LEO 环境对辐射宽容得多,允许采用带纠错功能的商用硬件可靠运行多年,因而通过适当设计在轨实现高密度存储是可行的。
• 晶体管密度自 2008 年 45 纳米时约 3MTr/mm²,已急剧增加到 2023–2024 年 3 纳米时约 220MTr/mm²,这意味着现代芯片若不通过设计和冗余进行缓解,其对辐射的脆弱性成倍上升。
• 现代 GPU 可以在卫星上运行,只要能接受一定的错误率——Starcloud-1 就证明了非关键任务应用在具备容错能力时能利用先进商用硬件。
• 对任何大规模存储阵列而言,冗余和纠错都是必要的;通过适当实施冗余可抵消因辐射导致的更高故障率。
• 理论上,更紧密的晶体管封装在有效屏蔽下能提高抗辐射性,水或铅既可作为冷却介质也可作辐射屏蔽,但实际实现仍复杂。
• 铅酸电池或许能因含铅而提供一定的辐射屏蔽,但金属在遭受宇宙射线撞击时会产生二次粒子,限制了其作为屏蔽材料的有效性。
• 将贵重电子设备送入太空带来可持续性问题:回收再利用不太可能,使得稀有材料实质上被从未来的陆地使用和循环经济中永久移除。
• 地面回收效率本就不高,大量电子废物最终入填埋场或焚烧炉,从资源回收角度看,把材料发射到太空并不会显著恶化现状。
• 高密度存储实现了极致的集成——单台 2U 服务器可容纳 10 PB——在 HFT 等昂贵的业务场景中很有吸引力,因为空间效率能抵消高额前期成本。
• 虽然 HFT 系统在执行期很少访问实时存储,但回放算法可从 PB 级数据集中受益,尽管此类负载通常并不与交易所同址部署。
• PCIe 带宽是充分利用密集 SSD 阵列的瓶颈;当前系统在共享存储通道时最大为 5x400Gbps,但即将到来的 PCIe 7.0 和 8.0 规范将有助于释放更高吞吐。
• 通往 1 PB SSD 的道路意味着 HDD 在大容量存储中终将退场;在功耗受限的超大规模环境中,QLC NAND 已在性能上超越了 HDD,尽管成本仍是障碍。
• 企业级 NVMe 价格极高——估计 10 PB 配置需 50 万至 100 万美元以上——在可预见的未来将只在超大规模、国防和科研领域采用。
• Dell 列出的满载 40 盘机箱价格约为 4000 万美元,但企业定价通常有 30–40% 的折扣,实际上系统价格仍会超过 1000 万美元。
• 消费级 SSD 价格已停滞或回升,现在 1 TB 驱动器的价格与几年前的 4 TB 驱动器相当,使大多数个人难以承担高容量升级。
• 有用户报告因供应不足无法购买某些 Kioxia 驱动器,这表明数据中心的高需求可能在限制可用性并抬高小买家的价格。
• 企业级 SSD 越来越像消耗品,在高写负载下寿命有限,这意味着二手市场上的驱动器可能已被磨损,不适合长期可靠使用。
• Intel 和 Micron 的 SSD 在企业缓存角色中似乎更容易出现故障,常在重度使用后转为只读态,令人对其耐久性产生担忧。
• 消费者对价格可接受的高容量 SSD 需求强烈,例如 100 美元以下的 4 TB TLC/QLC 驱动器,以便实现本地存储并减少对云提供商的依赖。
• 高容量的 3.5 英寸 SATA SSD 不太可能出现,因为与基于 PCIe 的 M.2 或 EDSFF 等形态相比,其在性能和经济性上并无优势,而后者提供更高的密度与效率。
• 已有将 EDSFF 驱动器(如 E3.L)通过适配器连接到 USB 或 PCIe 接口的方案,使其有可能与消费设备配合使用,但目前尚无紧凑的便携式解决方案。
• 高密度 QLC SSD 在数据保留方面可能不足以在没有大量冗余的情况下用于长期归档,这限制了其在冷存储或备份场景中的实用性。
• 家庭实验室爱好者希望企业级存储终有一天变得经济可及,但目前的价格和形态因素使其难以用于个人用途。
讨论揭示了超高密度存储的诱人潜力与物理、经济和工程限制之间的张力。尽管像 CDN 这样的轨道应用比完整在轨数据中心更可行,但太空中的辐射、功耗和热管理仍是重大障碍。在地面上,前沿 SSD 的高成本和快速迭代将其限制在超大规模和专业领域,短期内对消费者几乎没有好处。虽然人们对最终实现高密度、经济实惠的存储民主化抱有热情,但当前的价格上涨、供应限制和技术权衡表明这一目标仍遥远。可持续性问题也很紧迫——无论在陆地还是太空,电子废物都会导致贵重材料的永久流失。 • Orbital CDNs are a more practical near-term application for high-density storage than orbital datacenters, as caching media content in LEO or GEO satellites could reduce bandwidth strain on constellations like Starlink, especially since higher latency is tolerable for non-real-time content delivery.
• Radiation susceptibility makes cutting-edge SSDs poorly suited for space deployment; the RAD750 CPU uses a 150nm process, while modern SSDs rely on much denser, radiation-vulnerable transistors, meaning rapid degradation would occur without extensive hardening.
• Despite newer chips like AMD's 7nm Versal SoCs being used in Starlink satellites, radiation tolerance in FPGAs requires additional engineering like bitstream scrubbing and triple-module-redundancy, indicating that commercial off-the-shelf components still need significant adaptation for reliable space operation.
• Flash storage may not maintain data integrity over a 5-year Starlink satellite lifespan without substantial redundancy, and it remains unclear whether LEO servers solve problems better addressed by terrestrial datacenters given cost and reliability constraints.
• The RAD750 is outdated and designed for extreme radiation environments, but LEO is far more forgiving, allowing commercial-grade hardware with error correction to function reliably for years, making high-density storage feasible in orbit with proper design.
• Transistor density has increased dramatically—from ~3MTr/mm² at 45nm (2008) to ~220MTr/mm² at 3nm (2023–2024)—meaning modern chips are orders of magnitude more vulnerable to radiation unless mitigated through design and redundancy.
• Modern GPUs can operate in satellites if some error rate is acceptable, as demonstrated by Starcloud-1, suggesting non-mission-critical applications can leverage advanced commercial hardware in space with appropriate fault tolerance.
• Redundancy and error correction are essential for large-scale storage arrays in any environment, and simply increasing their implementation can offset the higher failure rates expected in space due to radiation.
• In theory, tighter transistor packing could improve radiation resistance if shielded effectively, with water or lead serving dual roles as coolant and radiation barrier, though practical implementation remains complex.
• Lead-acid batteries might offer incidental radiation shielding due to their lead content, but metals can produce secondary particles when struck by cosmic rays, limiting their effectiveness as shields.
• Sending valuable electronics into space raises sustainability concerns, as recovery and recycling are unlikely, effectively removing rare materials from future terrestrial use and circular economies.
• Recycling on Earth is already inefficient, with most waste ending up in landfills or incinerators, so launching materials into space doesn't significantly worsen the problem from a resource recovery standpoint.
• High-density storage enables extreme consolidation—10 PB in a single 2U server—making it attractive for colocation in expensive environments like HFT, where space efficiency offsets high upfront costs.
• While HFT systems rarely access live storage during execution, backtesting algorithms could benefit from petabyte-scale datasets, though such workloads are typically not co-located with exchanges.
• PCIe bandwidth is a bottleneck for fully utilizing dense SSD arrays; current systems max out at 5x400Gbps when sharing lanes with storage, but upcoming PCIe 7.0 and 8.0 specs will help unlock higher throughput.
• The roadmap toward 1 PB SSDs signals the eventual demise of HDDs for bulk storage, with QLC NAND already outperforming HDDs in power-constrained hyperscale environments, though cost remains a barrier.
• Enterprise NVMe prices are extremely high—estimated at $500k–$1M+ for a 10 PB setup—limiting adoption to hyperscalers, defense, and research sectors for the foreseeable future.
• Dell's listed price for a fully loaded 40-drive chassis is around $40M, but actual enterprise pricing typically involves 30–40% discounts, still placing the system well above $10M.
• Consumer SSD prices have stagnated or reversed, with 1 TB drives now costing what 4 TB drives did years ago, making high-capacity upgrades unaffordable for most individuals.
• Some users report being unable to purchase certain Kioxia drives due to supply shortages, indicating that high demand from datacenters may be constraining availability and driving up prices for smaller buyers.
• Enterprise SSDs are increasingly consumable items, with limited lifespans under heavy write loads, meaning secondary market drives may be worn out and unsuitable for reliable long-term use.
• Intel and Micron SSDs appear to fail more frequently than other brands in enterprise caching roles, often becoming read-only after heavy use, raising concerns about endurance in demanding applications.
• There is strong consumer demand for affordable, high-capacity SSDs—such as 4 TB TLC/QLC drives under $100—to enable local storage solutions and reduce reliance on cloud providers.
• SATA 3.5" SSDs with high capacity are unlikely to emerge because there's no performance or economic advantage over PCIe-based form factors like M.2 or EDSFF, which offer better density and efficiency.
• Adapters exist to connect EDSFF drives (like E3.L) to USB or PCIe interfaces, enabling potential use with consumer devices, though no compact, portable solutions are currently available.
• Data retention on high-density QLC SSDs is likely insufficient for long-term archival without significant redundancy, limiting their usefulness for cold storage or backup purposes.
• Homelab enthusiasts hope that enterprise-grade storage will eventually become affordable, but current pricing and form factor incompatibility make it inaccessible for personal use.
The discussion reveals a tension between the exciting potential of ultra-high-density storage and the practical limitations imposed by physics, economics, and engineering. While orbital applications like CDNs are seen as more feasible than full datacenters, radiation, power, and thermal challenges in space remain significant barriers. On Earth, the extreme cost and rapid obsolescence of cutting-edge SSDs confine them to hyperscalers and specialized sectors, with little near-term benefit for consumers. There is widespread enthusiasm for the eventual democratization of dense, affordable storage, but current trends—rising prices, supply constraints, and technical trade-offs—suggest that goal remains distant. Sustainability concerns also loom, as both terrestrial and space-based disposal of electronics risks permanently losing valuable materials.