Hacker News 中文摘要

RSS订阅

AWS大规模服务中断 -- Major AWS Outage Happening

文章摘要

AWS的DynamoDB数据库服务在us-east-1区域出现故障,导致服务中断。该问题在Reddit的AWS讨论区被用户报告和讨论。

文章总结

AWS DynamoDB 服务在 us-east-1 区域出现故障

2025年10月20日,AWS 的 DynamoDB 服务在 us-east-1 区域出现大规模故障,导致多个依赖该服务的应用程序无法正常运行。故障最初表现为 DynamoDB API 端点的 DNS 解析问题,随后影响了包括 IAM、SQS、Lambda、Kinesis 等多个 AWS 服务。

关键细节: 1. 故障表现:用户无法解析 dynamodb.us-east-1.amazonaws.com 的 DNS 记录,导致服务中断。 2. 影响范围: - 全球服务如 IAM、AWS Organizations 和 AWS Account Management 受到影响。 - 依赖 us-east-1 的应用程序(如 Prime Video、Reddit 等)出现故障。 - 部分用户无法登录 AWS 控制台或创建支持工单。 3. AWS 响应: - 初步调查指向 DNS 解析问题。 - 工程师采取了缓解措施,并在几小时后逐步恢复服务。 4. 用户反应: - 许多开发人员和运维人员因警报被唤醒,紧急处理故障。 - 用户调侃 AWS 的“降级”状态应改为“垃圾箱着火”。

后续进展: AWS 在故障发生后约两小时确认根本原因并开始恢复服务,但部分服务可能因积压请求而延迟恢复正常。

经验教训: - 避免过度依赖单一区域(如 us-east-1),应设计多区域架构以提高容错能力。 - 对关键业务系统,应考虑实施本地冗余或备用方案。

此次故障再次凸显了云计算服务集中化带来的风险,以及基础设施冗余的重要性。

评论总结

以下是评论内容的总结:

主要观点和论据

  1. AWS us-east-1区域故障影响广泛

    • 多个知名服务如Coinbase、Jira/Confluence、Slack、Vercel、npm、Postman、Heroku、Intercom、Twilio、Ring等受到影响
    • "At their scale and importance they should be multi-region if not multi-cloud" (nodesocket)
    • "This is why distributed systems is an extremely important discipline" (colesantiago)
  2. 对us-east-1可靠性的质疑

    • 用户指出us-east-1频繁出现故障
    • "I feel like every time I hear about an AWS outage it's in us-east-1" (__alexs)
    • "Considering the history of east-1 it is fascinating that it still causes so many single point of failure incidents" (AtNightWeCode)
  3. 状态页面的不可靠性

    • 多个服务的状态页面显示正常但实际上服务不可用
    • "NPM says they're up...but I am seeing a lot of packages not updating" (mittermayr)
    • "Useless service status pages are incredibly annoying" (seanieb)
  4. 对分布式系统的讨论

    • 批评一些公司虽然面试时强调分布式系统知识,但自身系统设计存在单点故障
    • "All the sites that ask for distributed systems in their interview...wouldn't even pass their own interview" (colesantiago)
    • "It's a reminder to never rely on something as flaky as the internet" (roschdal)
  5. 故障影响日常生活

    • 影响Alexa、MyFitnessPal等日常使用服务
    • "Our Alexa's stopped responding and my girl couldn't log in to myfitness pal" (philipp-gayret)
    • "The Internet felt oddly 'ill'" (jug)
  6. 故障恢复迹象

    • 有用户观察到部分服务开始恢复
    • "Some of our services are scaling up on east-1...issue might be resolving" (hipratham)
    • "website is back up, hooray" (SeanAnderson)

不同观点平衡

  • 批评观点:主要针对AWS us-east-1的可靠性问题和受影响服务的架构设计
  • 技术讨论:关于分布式系统重要性和多区域部署的必要性
  • 幽默观点:如"They are amazing at LeetCode though" (dude250711)
  • 恢复迹象:部分用户报告服务开始恢复

关键引用

  1. 关于影响范围:

    • "Affecting Coinbase as well...At their scale they should be multi-region" (nodesocket)
    • "Various AI services (e.g. Perplexity) are down as well" (ctbellmar)
  2. 关于us-east-1可靠性:

    • "Every time I hear about an AWS outage it's in us-east-1" (__alexs)
    • "Considering the history of east-1...still causes so many single point of failure incidents" (AtNightWeCode)
  3. 关于状态页面:

    • "NPM says they're up but packages not updating" (mittermayr)
    • "Ring status says everything is working" (seanieb)
  4. 关于分布式系统:

    • "Wouldn't even pass their own interview" (colesantiago)
    • "Reminder to never rely on something as flaky as the internet" (roschdal)