When "idle" isn't idle: how a Linux kernel optimization became a QUIC bug
163 points
• 6 days ago
• Article
Link
Cloudflare 的开源 QUIC 实现 quiche 默认使用 CUBIC 拥塞控制算法来管理 TCP 和 QUIC 连接的带宽。近期工程师发现了一个 bug:在发生拥塞崩溃后,CUBIC 的拥塞窗口(cwnd)会永久卡在最小值,即使网络条件恢复也无法增长。这个问题在大量丢包的集成测试中暴露出来,约 60% 的测试未能在预期时间内完成。
问题追溯到 2017 年 Linux 内核为防止空闲期后 cwnd 膨胀而做的一项优化。将该修复移植到 quiche 时引入了一个细微错误:实现用最后发送数据包的时间来度量空闲时长,而不是用最后收到 ACK 的时间。在最小 cwnd(仅两个数据包)情况下,这会让算法在每个往返时间都错误地判断为空闲,从而不断把恢复开始时间往前推移。结果形成了一个自我延续的循环:拥塞窗口被钉在最小值,在恢复和拥塞避免之间来回振荡,却始终无法增长。
修复只需一个关键的小改动:把空闲时长的起点改为 bytes_in_flight 实际降为零时的时刻(用最后 ACK 时间近似),而不是最后发送时间。这样避免了导致恢复边界不断追赶发送时间的 delta 值膨胀。仅三行代码的改动就使拥塞窗口在丢包后能按预期的 CUBIC 曲线正常增长,性能恢复如常。该方案既保留了原始优化对真正空闲连接的好处,又消除了最小 cwnd 处的"死亡螺旋"。
这一案例说明,"空闲"的定义比表面看起来更复杂,尤其在窗口很小时,管道延迟可能被误判为空闲。该 bug 在正常运行时不可见,仅在严重拥塞后显现。尽管工程师们使用 qlog 和可视化工具进行了数周调查,最终的修复却非常简单,证明有时最难捉的 bug 也有优雅的解决方案。该修复已提交到 cloudflare/quiche 仓库,提升了使用 CUBIC 拥塞控制的 QUIC 连接的可靠性。
Cloudflare's open-source QUIC implementation, quiche, uses the CUBIC congestion control algorithm as its default, which governs how TCP and QUIC connections manage bandwidth. Recently, engineers discovered a bug where CUBIC's congestion window (cwnd) would get permanently stuck at its minimum size after a congestion collapse, preventing recovery even when network conditions improved. This issue surfaced during integration tests involving heavy packet loss, where about 60% of tests failed to complete within the expected timeframe.
The problem was traced to a Linux kernel optimization from 2017 designed to prevent cwnd inflation after idle periods. When this fix was ported to quiche, it introduced a subtle bug. The implementation incorrectly measured idle time by using the last packet sent time rather than the last ACK received time. At minimum cwnd (just two packets), this caused the algorithm to falsely detect an idle period every round trip, pushing the recovery start time forward repeatedly. This created a self-perpetuating cycle where the congestion window remained pinned at its minimum, oscillating between recovery and congestion avoidance states without ever growing.
The fix involved a small but crucial change: measuring idle duration from when bytes_in_flight actually transitioned to zero (approximated by the last ACK time) instead of the last send time. This prevented the inflated delta calculation that was causing the recovery boundary to chase the send time. With this three-line change, the congestion window could properly grow along the expected CUBIC curve after loss events, restoring normal performance. The solution preserved the original optimization's benefits for genuinely idle connections while eliminating the death spiral at minimum cwnd.
This case highlights how defining "idle" is more complex than it appears, especially at small window sizes where pipeline delays can mimic idleness. The bug was invisible during normal operation and only manifested after severe congestion events. Despite weeks of investigation using qlog instrumentation and visualization, the final fix was remarkably simple, demonstrating that sometimes the most elusive bugs have elegant solutions. The fix has been contributed to the cloudflare/quiche repository, improving reliability for QUIC connections using CUBIC congestion control.
36 comments • Comments Link
文章写作风格有明显的 AI 辅助痕迹——随机加粗的词、不自然的铺垫,以及在裁员 20% 后只发布一个工程实习生的招聘广告,这些都让人怀疑其判断力,甚至让人怀疑公关团队是否也受影响。
Cloudflare 用 Rust 在用户空间重写 QUIC 的决定是合理的,但在维护自有实现时必须保持警惕,避免错过内核中关键错误修复,因为这些用户态实现通常比内核代码接受的审查要少得多。
CUBIC 算法中,静默期后拥塞窗口突然跳跃的 bug 最早于 2015 年在 Google 的 QUIC 库中被发现,随后报告给了 TCP 内核团队。这说明拥塞控制算法容易出现微妙的逻辑错误,并可能产生严重的现实后果。
采用经过实战检验的拥塞控制实现很有价值:这些实现已经在各种真实互联网流量中经受过考验,能够发现内部实现可能忽略的故障模式。
文章没有明确定义"CCA"(拥塞控制算法),即便有专门章节解释,这对不熟悉该术语的读者来说是重大疏忽。
在大带宽的数据中心环境下,CUBIC 的恢复速度很慢:丢包后需要接近两秒才能达到带宽 - 延迟积,并且在接近上限时会反复自伤;相比之下,BBR 采用基于模型的方法,保留了余量,能够在不对丢包过度反应的情况下实现更高吞吐量。
文章的结构和小标题给人强烈的 AI 生成感,后半部分尤为明显,写作更注重营造感受而非有意义地组织内容,这不同于更自然的技术文章。
尽管由于 LLM 处理导致文章质量下降、出现不必要的"手把手"指导,底层工程工作仍然扎实:设计良好的测试在图表异常时揭示了 CUBIC 的 bug,体现了真正的工程严谨性。
文章标题本可以更准确地反映问题根源——复制 Linux 内核代码却未完全理解、错过后续修复,并且缺乏防止此类问题的建议;鉴于该 bug 本可预防,这是一个可惜的遗漏。
多年来 Cloudflare 博客文章的质量明显变化,最近的帖子更显怪异而非彰显工程卓越,引发了对其招聘实践或工程文化变化的质疑。
总体讨论暴露出人们对文章真实性和质量的怀疑,许多评论者识别出 AI 辅助写作的模式和糟糕的编辑决策。虽然 CUBIC bug 的技术内容得到了肯定,但因呈现缺乏清晰性和自然结构而遭到批评。大家对 Cloudflare 近期的博客和工程文化表示担忧,但普遍认为底层技术工作仍然扎实;讨论还强调了采用经充分测试的拥塞控制实现以及通过严格测试发现微妙算法缺陷的重要性。 • The article's writing style feels AI-assisted, with random bold words and an unnatural buildup, and the inclusion of a recruitment plug just days after laying off 20% of the company, with only one engineering intern role currently open, raises questions about judgment and whether the PR team was also affected.
• Cloudflare's decision to rewrite QUIC in Rust for userspace makes sense, but maintaining an in-house implementation requires vigilance to avoid missing critical bug fixes from the kernel, as these implementations typically receive less scrutiny than kernel code.
• The specific CUBIC bug involving a sudden congestion window jump after quiescence was originally discovered in Google's QUIC library in 2015 and later reported to the TCP kernel team, highlighting how congestion control algorithms are prone to subtle logic bugs with dramatic real-world consequences.
• Using battle-tested implementations of congestion control algorithms is valuable because they've been tested across diverse real-world Internet traffic, catching failure modes that in-house code might miss.
• The article fails to define "CCA" (Congestion Control Algorithm) despite having a dedicated section explaining it, which is a significant oversight for readers unfamiliar with the term.
• CUBIC's recovery in datacenters with large pipes is slow, taking nearly two seconds to reach bandwidth-delay product after loss, and it repeatedly shoots itself in the foot upon hitting the ceiling, whereas BBR uses a model-based approach with headroom to achieve higher throughput without reacting as aggressively to loss.
• The article's structure and subtitles feel very AI-generated, with the second half being particularly obvious, and the writing is engineered to make you feel rather than to structure content meaningfully, unlike more natural technical writeups.
• Despite the LLM pass making the article worse with unnecessary hand-holding, the underlying engineering work is solid, with a well-designed test that revealed the CUBIC bug when the graph didn't match expectations, demonstrating genuine engineering rigor.
• The article's title could more accurately reflect that the issue stemmed from copying Linux kernel code without fully understanding it and missing subsequent fixes, and it lacks takeaways on preventing such issues, which is a missed opportunity given the preventable nature of the bug.
• There's a noticeable shift in Cloudflare's blog post quality over the years, with recent posts exposing weirdness rather than highlighting engineering excellence, raising questions about changes in hiring practices or engineering culture.
The discussion reveals skepticism about the article's authenticity and quality, with multiple commenters identifying AI-assisted writing patterns and poor editorial decisions. The technical content about the CUBIC bug is appreciated, but the presentation is criticized for lacking clarity and natural structure. There's concern about Cloudflare's recent blog posts and engineering culture, with the consensus suggesting a decline in quality despite the underlying technical work being sound. The conversation also highlights the importance of using well-tested congestion control implementations and the value of rigorous testing to uncover subtle algorithm bugs.