文章摘要
AWS的DynamoDB数据库服务在us-east-1区域出现故障,导致服务中断。该问题在Reddit的AWS讨论区被用户报告和讨论。
文章总结
AWS DynamoDB 服务在 us-east-1 区域出现故障
2025年10月20日,AWS 的 DynamoDB 服务在 us-east-1 区域出现大规模故障,导致多个依赖该服务的应用程序无法正常运行。故障最初表现为 DynamoDB API 端点的 DNS 解析问题,随后影响了包括 IAM、SQS、Lambda、Kinesis 等多个 AWS 服务。
关键细节: 1. 故障表现:用户无法解析 dynamodb.us-east-1.amazonaws.com 的 DNS 记录,导致服务中断。 2. 影响范围: - 全球服务如 IAM、AWS Organizations 和 AWS Account Management 受到影响。 - 依赖 us-east-1 的应用程序(如 Prime Video、Reddit 等)出现故障。 - 部分用户无法登录 AWS 控制台或创建支持工单。 3. AWS 响应: - 初步调查指向 DNS 解析问题。 - 工程师采取了缓解措施,并在几小时后逐步恢复服务。 4. 用户反应: - 许多开发人员和运维人员因警报被唤醒,紧急处理故障。 - 用户调侃 AWS 的“降级”状态应改为“垃圾箱着火”。
后续进展: AWS 在故障发生后约两小时确认根本原因并开始恢复服务,但部分服务可能因积压请求而延迟恢复正常。
经验教训: - 避免过度依赖单一区域(如 us-east-1),应设计多区域架构以提高容错能力。 - 对关键业务系统,应考虑实施本地冗余或备用方案。
此次故障再次凸显了云计算服务集中化带来的风险,以及基础设施冗余的重要性。
评论总结
以下是评论内容的总结:
主要观点和论据
AWS us-east-1区域故障影响广泛
- 多个知名服务如Coinbase、Jira/Confluence、Slack、Vercel、npm、Postman、Heroku、Intercom、Twilio、Ring等受到影响
- "At their scale and importance they should be multi-region if not multi-cloud" (nodesocket)
- "This is why distributed systems is an extremely important discipline" (colesantiago)
对us-east-1可靠性的质疑
- 用户指出us-east-1频繁出现故障
- "I feel like every time I hear about an AWS outage it's in us-east-1" (__alexs)
- "Considering the history of east-1 it is fascinating that it still causes so many single point of failure incidents" (AtNightWeCode)
状态页面的不可靠性
- 多个服务的状态页面显示正常但实际上服务不可用
- "NPM says they're up...but I am seeing a lot of packages not updating" (mittermayr)
- "Useless service status pages are incredibly annoying" (seanieb)
对分布式系统的讨论
- 批评一些公司虽然面试时强调分布式系统知识,但自身系统设计存在单点故障
- "All the sites that ask for distributed systems in their interview...wouldn't even pass their own interview" (colesantiago)
- "It's a reminder to never rely on something as flaky as the internet" (roschdal)
故障影响日常生活
- 影响Alexa、MyFitnessPal等日常使用服务
- "Our Alexa's stopped responding and my girl couldn't log in to myfitness pal" (philipp-gayret)
- "The Internet felt oddly 'ill'" (jug)
故障恢复迹象
- 有用户观察到部分服务开始恢复
- "Some of our services are scaling up on east-1...issue might be resolving" (hipratham)
- "website is back up, hooray" (SeanAnderson)
不同观点平衡
- 批评观点:主要针对AWS us-east-1的可靠性问题和受影响服务的架构设计
- 技术讨论:关于分布式系统重要性和多区域部署的必要性
- 幽默观点:如"They are amazing at LeetCode though" (dude250711)
- 恢复迹象:部分用户报告服务开始恢复
关键引用
关于影响范围:
- "Affecting Coinbase as well...At their scale they should be multi-region" (nodesocket)
- "Various AI services (e.g. Perplexity) are down as well" (ctbellmar)
关于us-east-1可靠性:
- "Every time I hear about an AWS outage it's in us-east-1" (__alexs)
- "Considering the history of east-1...still causes so many single point of failure incidents" (AtNightWeCode)
关于状态页面:
- "NPM says they're up but packages not updating" (mittermayr)
- "Ring status says everything is working" (seanieb)
关于分布式系统:
- "Wouldn't even pass their own interview" (colesantiago)
- "Reminder to never rely on something as flaky as the internet" (roschdal)