Hacker News 中文摘要

文章摘要

2025年11月18日，Cloudflare网络因数据库权限变更导致特征文件异常增大，引发全球性服务中断。故障并非网络攻击所致，而是由于扩大的特征文件超出系统处理限制，影响了流量路由功能。事件持续12分钟，导致用户访问客户网站时出现错误页面。

文章总结

Cloudflare 2025年11月18日服务中断事件报告

事件概述
2025年11月18日11:20（UTC），Cloudflare网络出现核心流量传输故障，导致用户访问客户网站时显示错误页面。此次中断持续约6小时，至17:06全面恢复。

根本原因
1. 数据库权限变更：对ClickHouse数据库权限系统的更新导致Bot管理系统使用的特征配置文件出现重复条目，文件体积翻倍。 2. 系统限制触发：核心代理软件对特征文件大小存在预设限制（200个特征），而异常文件超出该限制，引发系统崩溃。

影响范围
| 服务/产品 | 影响表现 | |----------------|--------------------------------------------------------------------------| | 核心CDN与安全服务 | 返回HTTP 5xx错误代码，终端用户看到错误页面 | | Turnstile | 无法加载，导致控制面板登录功能中断 | | Workers KV | 前端网关请求失败，5xx错误率显著升高 | | 控制面板 | 因Turnstile故障导致多数用户无法登录 | | 邮件安全 | 临时性IP信誉数据访问中断，垃圾邮件检测准确度下降 | | Access | 大规模认证失败，但已有会话不受影响 |

时间线关键节点
- 11:05 数据库权限变更部署
- 11:28 首批客户流量报错
- 13:05 实施Workers KV和Access的应急绕过方案
- 14:30 回滚正常版本特征文件，核心流量恢复
- 17:06 所有系统完全正常化

技术细节
1. 故障传播机制：每5分钟生成的特征文件在更新/未更新的数据库节点间交替产生正确/错误版本，导致系统间歇性恢复。
2. 误判因素：独立运行的状态页面同时宕机（后证实为巧合），初期被误认为超大规模DDoS攻击。
3. 内存预分配机制：Bot管理模块为优化性能预分配固定内存，但未处理超限异常导致进程崩溃。

改进措施
- 强化配置文件输入验证机制
- 部署全局功能熔断开关
- 优化错误报告资源占用
- 全面审查核心代理模块的故障模式

官方致歉
Cloudflare承认这是自2019年以来最严重的全网级中断，对造成的互联网服务中断表示深刻歉意，承诺将通过架构升级杜绝此类事件复发。

（注：原文中涉及的多张示意图、代码片段及历史事件引用等辅助信息已精简，保留核心事实与关键数据）

评论总结

以下是评论内容的总结，关注主要观点和论据，并保持不同观点的平衡性：

1. 事故原因分析

数据库权限变更导致的问题：数据库权限变更导致查询结果重复，进而使配置文件超出限制，引发系统崩溃。
- "a change to one of our database systems' permissions which caused the database to output multiple entries into a 'feature file'"
- "The software had a limit on the size of the feature file that was below its doubled size. That caused the software to fail"
测试不足：变更未在测试环境中充分验证，直接部署到生产环境。
- "it seems intuitive to me that doing this same migration in staging would have identified this exact error"
- "they didn’t test this on a non-production system mimicing production"

2. 技术实现问题

Rust代码中的.unwrap()使用：在关键路径中使用.unwrap()导致线程恐慌（panic），缺乏错误处理。
- "thread fl2workerthread panicked: called Result::unwrap() on an Err value"
- "This is the multi-million dollar .unwrap() story"
配置文件和部署策略：缺乏渐进式部署和回滚机制，配置文件变更直接影响了整个系统。
- "The remediation section doesn’t give me any sense that phased deployment, acceptance testing, and rapid rollback are part of the planned remediation strategy"
- "they didn’t have a circuit breaker to stop the deployment or roll-back when a newly deployed app started crashing"

3. 事故影响与恢复

影响范围：短时间内产生大量错误请求，影响广泛。
- "28M 500 errors/sec for several hours from a single provider"
- "26M/s 5xx error HTTP status codes over a span of roughly two hours"
恢复时间：事故持续数小时，恢复过程较长。
- "Core traffic was largely flowing as normal by 14:30 [...] As of 17:06 all systems at Cloudflare were functioning as normal"
- "Why did this take so long to resolve?"

4. 对Cloudflare的反馈

正面评价：事故报告透明、详细，发布迅速。
- "Great post-mortem. Very clear"
- "Kudos for releasing a post mortem in less than 24 hours after the outage"
改进建议：建议加强测试、错误处理和部署策略。
- "Absent from this list are canary deployments and incremental or wave-based deployment of configuration files"
- "They should have lints to identify and ideally deny panic inducing code"

5. 其他观点

语言选择争议：部分评论认为Rust并非问题根源，而是代码实现问题。
- "This wasn’t a Rust problem, no language would have saved them from this kind of issue"
- "The compiler would have warned that this was a possible issue"
行业教训：类似事故（如CrowdStrike）的教训未被充分吸取。
- "This is something the industry was supposed to learn from the CrowdStrike incident last year"

总结：评论普遍认为事故由多个因素共同导致，包括数据库权限变更、测试不足、代码错误处理缺失以及部署策略问题。尽管对Cloudflare的透明报告表示赞赏，但也提出了许多改进建议。

2025年11月18日Cloudflare服务中断事件分析报告 -- Cloudflare outage on November 18, 2025 post mortem