Hacker News 中文摘要

文章摘要

Reddit因发现AI公司通过Internet Archive的Wayback Machine抓取其数据，决定限制后者对Reddit内容的索引，仅允许存档首页，以保护用户隐私和平台政策。Internet Archive旨在保存网站数字档案，但Reddit认为并非所有内容都应被存档。

文章总结

Reddit宣布将限制互联网档案馆（Internet Archive）对其数据的抓取，原因是发现一些AI公司通过互联网档案馆的Wayback Machine工具非法抓取Reddit的数据。Reddit表示，Wayback Machine将无法再抓取帖子详情页、评论或用户资料，仅能索引Reddit.com的首页。这意味着互联网档案馆只能存档某一天最受欢迎的新闻标题和帖子。

Reddit发言人Tim Rathschmidt表示，互联网档案馆为开放网络提供服务，但他们发现一些AI公司违反平台政策，通过Wayback Machine抓取数据。Reddit认为并非所有内容都应被存档，因此决定限制互联网档案馆对其数据的访问，以保护用户隐私和平台政策。

Reddit已提前通知互联网档案馆这一限制，并表示过去曾对其抓取能力表示担忧。近年来，Reddit多次限制抓取工具的访问，尤其是在AI公司大规模使用（甚至滥用）这些工具的情况下。不过，Reddit愿意在付费的前提下提供数据。去年，Reddit与谷歌达成协议，提供搜索和AI训练数据，随后开始要求主要搜索引擎付费才能抓取其数据。此外，Reddit在2023年因API被滥用训练AI模型而调整了API政策，导致部分第三方应用关闭并引发抗议。

Reddit还与OpenAI达成AI数据协议，但在6月起诉Anthropic，指控其仍在抓取Reddit数据，尽管Anthropic此前声称已停止抓取。

Wayback Machine负责人Mark Graham表示，他们与Reddit有长期合作关系，并继续就此问题进行讨论。

（更新于8月11日：补充了Wayback Machine的声明。）

评论总结

对AI和平台盈利的质疑：评论1指出，许多人希望通过AI赚钱，但目前尚未看到平台因此获得显著收益。
- 引用："Everyone wants to close down their corner of the internet because they think AI is going to make them a ton of money."
- 引用："We're getting the first part but I'm not sure we're seeing the latter ... anywhere as far as platforms go."
对广告和用户体验的担忧：评论2认为，平台真正担心的是用户通过LLM（大语言模型）阅读内容，从而绕过广告和下载提示，导致广告收入减少。
- 引用："What they're really afraid of is that people will read content using LLM inference and make all the ads and nags go away."
- 引用："The front end for de-enshittification looks a lot like that other archive site."
对信息来源的质疑：评论3质疑Reddit相关说法的来源，指出没有官方博客或声明支持这一说法。
- 引用："What is the source? Where did Reddit say this? No blog post or release anywhere."
对数字历史脆弱性的反思：评论4感叹数字历史的脆弱性，认为未来可能对1820年的记录比2020年更完整。
- 引用："It's so weird how fragile digital history is."
- 引用："In 30 years we'll have a better record of 1820 than 2020."
对Reddit屏蔽行为的讨论：评论5指出，Reddit已经以某种方式屏蔽了非住宅IP，包括云服务IP，但并非专门针对Wayback Machine。
- 引用："They are not specifically targeting Wayback Machine."
- 引用："Anything other than residential IP's are blocked, to my information."
对Reddit存档网站的观察：评论6提到，一些持续存档Reddit内容的网站并未受到影响。
- 引用："Meanwhile, sites that constantly archive Reddit remain unscathed."
对存档与版权平衡的探讨：评论7提出，是否可以通过允许Internet Archive抓取但不允许立即查看的方式，平衡存档与版权保护。
- 引用："Is there a way to allow IA to scrape the site but not allow viewing the results (for 'x' weeks/months)?"
- 引用："A way to balance archiving and exploitation."

Reddit将屏蔽互联网档案馆 -- Reddit will block the Internet Archive

文章摘要

文章总结

评论总结