How to Write to SSDs [pdf]
209 points
• 3 days ago
• Article
Link
本文认为数据库系统必须采用异地写入(out-of-place writes),以充分发挥 SSD 的性能并延长其寿命。作者证明,MySQL 、 PostgreSQL 等系统采用的传统原地写入在 DBMS 和 SSD 两层都会引起严重的写放大(WA)。例如,LeanStore 中一次 4 KiB 页面写入实际上在闪存上写入了 18.85 KiB,放大约 4.7 倍,这主要由 DBMS 层的双写缓冲和 SSD 层的垃圾回收导致。这不仅浪费带宽、增加延迟,还大幅缩短 SSD 的耐久性:测试中 SSD 在负载下仅 1.5 个月就达到了写入寿命上限。
为了解决这些问题,作者提出了一套基于异地写入架构的优化方案。 DBMS 层引入页面级压缩与页面打包,在减少写量的同时保持高效的 4 KiB 对齐读取;并提出按死亡时间分组(GDT),利用数据库语义估算页面失效时间,将生命周期相近的页面归为一组,从而在垃圾回收时减少 DB 层的写放大,确保同一区域内的页面在大致相同时间失效。
在 SSD 层,作者提出了降低内部写放大的方法。对于 Zoned Namespace(ZNS)SSD,设计与主机管理的 zone 自然对齐,可保证 SSD 的写放大因子(WAF)为 1 。对于普通 SSD,作者将 DBMS 的垃圾回收单元与 SSD 的内部超级块(superblock)大小对齐,该大小可通过 FDP Reclaim Unit 信息或类似 ZNS 的写入模式推断出来。另一个关键是 NoWA(No Write Amplification)模式:通过补偿写入确保 SSD 始终有完全失效的超级块可用,从而消除对 SSD 层垃圾回收的需求,即便在商用硬件上也能实现 WAF=1 。
作者在基于 B 树的 LeanStore 的修改版 ZLeanStore 中实现了这些优化。多种基准测试和不同 SSD 上的评估表明效果显著:在 YCSB-A 上,吞吐量提升 1.65–2.24 倍,单次操作的闪存写入量减少 6.2–9.8 倍;在 15,000 仓库的 TPC-C 测试中,吞吐量提升 2.45 倍,闪存写入减少 7.2 倍。该设计还无缝支持 ZNS 、 FDP 等现代 SSD 接口,为实现更高效、更耐用的数据库存储提供了可行路径。
This paper argues that database systems must adopt out-of-place writes to fully leverage SSD performance and extend SSD lifespan. The authors demonstrate that traditional in-place write designs, used by systems like MySQL and PostgreSQL, suffer from significant write amplification (WA) at both the DBMS and SSD layers. For example, a single 4 KiB page write in LeanStore results in 18.85 KiB of actual flash writes, a 4.7x amplification caused by DBMS-level doublewrite buffering and SSD-level garbage collection. This wastes bandwidth, increases latency, and drastically shortens SSD endurance, with the tested SSD reaching its write limit in just 1.5 months under load.
To address this, the authors propose a set of optimizations built on an out-of-place write architecture. At the DBMS level, they introduce page-wise compression combined with page packing to reduce write volume while maintaining efficient 4 KiB-aligned reads. They also propose Grouping by Death Time (GDT), which uses database semantics to estimate when pages will be invalidated and groups those with similar lifetimes together. This reduces DB-level write amplification during garbage collection by ensuring zones contain pages that become invalid around the same time.
At the SSD level, the paper presents techniques to minimize internal SSD write amplification. For Zoned Namespace (ZNS) SSDs, the design naturally aligns with the host-managed zones, guaranteeing an SSD WAF of 1. For standard SSDs, the authors align the DBMS garbage collection unit with the SSD's internal superblock size, inferred either through FDP Reclaim Unit information or a ZNS-like write pattern. They also introduce the NoWA (No Write Amplification) pattern, which uses compensation writes to ensure the SSD always has fully invalidated superblocks available, eliminating the need for SSD-level garbage collection and achieving WAF = 1 even on commodity hardware.
The authors implement these optimizations in ZLeanStore, a modified version of the B-tree-based LeanStore. Evaluation across diverse benchmarks and SSDs shows substantial improvements. On YCSB-A, throughput increases by 1.65–2.24x while flash writes per operation decrease by 6.2–9.8x. For TPC-C with 15,000 warehouses, throughput improves 2.45x with a 7.2x reduction in flash writes. The design also seamlessly supports modern SSD interfaces like ZNS and FDP, demonstrating a practical path toward more efficient and durable database storage.
32 comments • Comments Link
• 该论文提出了 NoWA("零写入放大")模式,即使在设备已满的情况下,也能将 SSD 的写入放大因子(WAF)降至接近 1 。值得注意的是,作者在来自多家厂商的消费级和企业级 SSD 上进行了验证,证明了其广泛适用性。
• NoWA 的核心思想是将应用层的垃圾回收与 SSD 内部的垃圾回收对齐,最大限度地减少必须在 SSD 上移动的有效页面,从源头降低写入放大,并确保待删除的数据在物理块级别保持分组状态。
• ZNS 和较新的 NVMe flexible data placement(FDP)等分区存储标准是关键进展,它们允许应用通过写入亲和性标识符标记写入,使驱动器能够将相关数据共置,从而显著减少由碎片化导致的垃圾回收开销。
• FDP 被强调为一种标准化、实现成本低的特性,在数据共置和写入亲和性方面潜力巨大,但目前可用性仍然有限,主要集中在价格较高的企业级驱动器上并需特殊采购,这阻碍了更广泛的开发者社区进行试验或采用。
• 企业级存储系统常用 NVRAM 缓冲吸收并合并随机写入,然后再将其刷新到 SSD,以掩盖慢速写入带来的性能惩罚。但如果数据更新频率不足,无法在持久化前被缓冲区完全吸收,则对减少写入放大的效果有限。
• 一些评论者将 NoWA 与 SMR 硬盘或 Zoned XFS 等系统中采用的类似机制相比较,这些机制同样试图在存储栈低层优化数据放置,表明针对特定硬件特性进行优化可以在不同驱动器技术中减少放大效应。
• 预计 SQLite 会像 PostgreSQL 和 MySQL 那样面临写入放大问题。尽管 SQLite 采用单写入器架构,但仍依赖原地更新,其具体行为取决于写入倾斜、填充因子以及底层 SSD 的特性等因素。
• 该论文因提出一个全面的框架而受到赞誉,它将零散的存储优化技术整合为一致的策略,弥合了存储工程专业知识与数据库开发之间的差距,尽管不一定会催生全新的数据库架构。
• 这项研究为未来数据库存储引擎的优化奠定了基础,可能促成高效的 Postgres 扩展或可插拔存储层,以显式管理 WAF,特别适用于 SSD 寿命和写入性能为关键瓶颈的大规模部署场景。 • The paper introduces a "No Write Amplification" (NoWA) pattern that achieves a near-perfect SSD Write Amplification Factor (WAF) of 1, even at full device capacity, which is notable because this was tested across commodity SSDs from multiple vendors, demonstrating broad applicability to consumer and enterprise hardware.
• The core insight behind NoWA is that aligning application-level garbage collection with the SSD's internal garbage collection minimizes the need for the SSD to move valid pages around, effectively reducing write amplification at the source and ensuring data slated for deletion remains grouped at the physical block level.
• Zoned storage standards like ZNS and the newer NVMe Flexible Data Placement (FDP) are seen as a critical advancement because they allow applications to tag writes with specific write-affinity identifiers, enabling the drive to co-locate related data and significantly reducing the overhead caused by fragmented garbage collection.
• FDP is highlighted as a standardized, low-cost-to-implement feature that offers massive potential for read-write affinity benefits, though availability remains limited as of now, restricted mostly to expensive enterprise drives and requiring special procurement, preventing widespread experimentation or adoption by the broader developer community.
• Enterprise storage systems already NVRAM buffers to absorb and consolidate random writes before flushing them to SSDs, which helps mask the performance penalty of slow writes but is less effective at eliminating write amplification itself unless the data is updated rapidly enough to be absorbed entirely within the buffer before persistence.
• Some commenters compared the NoWA approach to similar mechanisms used by SMR hard drives or the Zoned XFS filesystem, which also attempt to optimize data placement at lower levels of the storage stack, suggesting that optimizing for these specific hardware characteristics can reduce amplification across different drive technologies.
• SQLite is expected to experience similar write amplification issues to PostgreSQL and MySQL because it relies on in-place updates despite its single-writer architecture, though its specific behavior depends on factors like write skew, fill factor, and the specific characteristics of the underlying SSD hardware.
• The paper is praised for providing a comprehensive framework that connects fragmented storage optimization techniques into a cohesive strategy, effectively bridging the gap between storage engineering expertise and database development, even if it doesn't necessarily create entirely new database architectures.
• The research serves as a foundation for future database storage engine optimizations, potentially leading to highly efficient Postgres extensions or pluggable storage layers that explicitly manage WAF, particularly for large-scale deployments where SSD longevity and write performance are critical bottlenecks.