Hacker News 中文摘要

RSS订阅

睡眠中运行的代理 -- Agents that run while I sleep

原文链接 | HN讨论 | 2026-03-11 05:10:52

文章摘要

作者开发了自动编写代码的代理工具，可以在无人监督时持续工作，但发现难以验证其输出的正确性。随着AI生成代码的增多，团队代码审查压力剧增，传统解决方案如增加人力或AI自测效果有限。核心问题在于如何建立对无法全面审查的自动化系统的信任机制。

文章总结

标题：我在构建夜间自动运行的AI编程助手

核心内容： 1. 作者开发了名为Gastown的AI编程助手，能在睡眠时自动编写代码，但面临代码质量验证难题 - 每周自动生成40-50个PR请求（远超人工的10个） - 团队将大量时间耗费在代码审查上

现有解决方案的局限性：
- 增加人工审查不可持续
- AI自测存在"自我验证陷阱"（同源AI会重复相同错误）
- 传统测试无法发现需求理解偏差
提出的解决方案：
- 采用测试驱动开发(TDD)理念
- 预先编写可验证的验收标准（AC）
- 示例登录功能包含4项具体AC：
  - 成功登录的跳转和cookie设置
  - 错误提示的精确文案
  - 空字段验证
  - 失败次数限制
技术实现：
- 开发了Claude验证工具链（GitHub开源项目）
- 四阶段验证流程：
  - 预检（环境检查）
  - 规划（生成测试方案）
  - 并行浏览器测试（使用Playwright）
  - 最终裁决（生成验证报告）
- 支持插件化安装和CI集成

关键洞见： - 必须预先定义"完成标准"才能信任AI产出 - 编写验收标准比写提示词更困难但更必要 - 该方法能有效捕捉界面渲染、API行为等实现级错误（虽不能解决需求误解）

（注：已剔除订阅推广、图片链接等非核心内容，保留技术细节和论证逻辑）

评论总结

以下是评论内容的总结：

1. AI生成测试的局限性

观点：AI生成的测试可能只是验证代码本身的行为，而非实际需求
- "A lot of time it just confirms that the code does what the code does" (RealityVoid)
- "this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass" (vidimitrov)

2. 测试隔离与验证方法

观点：需要隔离测试和代码生成过程以提高可靠性
- "Have another AI write the tests...Have it audit the tests" (fragmede)
- "treat hand-written scenarios as a holdout set that the generating agent literally never sees" (foundatron)

3. 多AI协作验证

观点：使用不同AI模型互相验证可以提高质量
- "You can have Gemini write the tests and Claude write the code" (lateforwork)
- "one agent writes code...one agent writes tests...a QA agent runs the tests" (seanmcdirmid)

4. 代码审查挑战

观点：AI生成大量代码带来审查困难
- "It's like 20k of line changes over 30-40 commits...no proper solution to this problem yet" (afro88)
- "The pipeline does diff truncation...caps at 100KB" (foundatron)

5. 对TDD的讨论

观点：澄清TDD的正确实践方式
- "TDD is not 'write the tests, then write the code.' It's 'write the tests while writing the code'" (jdlshore)
- "red/green/refactor is a reasonable way through this problem" (dzuc)

6. 对软件质量的担忧

观点：AI编码可能导致软件质量标准下降
- "we are heading to a world in which we simply give up on the idea of correct code" (throwyawayyyy)
- "people will get used to software being unreliable" (throwyawayyyy)

7. 解决方案建议

观点：提出各种技术解决方案
- "cucumber/gherkin - very old school regex-to-code plain english kind of system" (jaggederest)
- "Ouroboros...uses Socratic questioning to score specification ambiguity" (foundatron)