Hacker News 中文摘要

文章摘要

该研究提出SWE-CI框架，通过持续集成环境评估AI代理在代码库维护中的能力，重点关注代理在软件开发周期中的自动化任务处理表现。

论文标题：SWE-CI：通过持续集成评估智能体维护代码库的能力

核心内容： 1. 研究背景： - 当前大型语言模型（LLM）驱动的智能体在静态错误修复等软件工程任务中表现出色（如SWE-bench基准测试所示） - 但现实软件开发涉及复杂需求变更和长期功能迭代，现有静态一次性修复模式无法体现这一动态过程

（注：已过滤网页导航元素、重复性说明和辅助功能信息，保留核心学术内容）

以下是评论内容的总结，平衡呈现不同观点并保留关键引用：

基准测试的有效性质疑
- 认为测试样本量不足："evaluating long term maintainability over an average of just 500 loc changes does not sound like long term" (challengerVIE)
- 建议扩大数据集："The dataset would need to be way bigger to get close to the likes of SWE-bench" (KronisLV)
模型版本争议
- 指出版本不对等："they're benchmarking Opus 4.6 against GPT-5.2 (which is three generations behind)" (woadwarrior01)
- 认为应测试最新版本："the paper doesn't include gpt 5.3 which was released around the same time as opus 4.6" (gizmodo59)
Claude表现突出
- 数据显示优势："Claude wins by a large margin...GPT-5.2 : 0.23" (mentalgear)
- 但认为实际差距不大："I see both Claude and gpt to be neck and neck in coding" (gizmodo59)
回归问题关注
- 普遍存在回归："showing really bad regression rates across the board" (verdverm)
- 结构改进建议："keeping everything in a single tree...so the agent sees downstream effects" (yuyuqueen)
基准测试局限性
- 无法检测深层问题："cannot capture...whether your fix preserves the invariants" (agent5ravi)
- 可能被操纵："future LLMs will be optimized to hide regressions" (PunchyHamster)
其他建议
- 增加人类对照："compared against a human baseline" (50lo)
- 细化分类统计："report per-category numbers" (jbergqvist)
- 测试成本增加："the eval set becoming more and more expensive" (smy20011)