Hacker News 中文摘要

文章摘要

作者原本希望通过修改robots.txt文件禁止所有爬虫访问自己的博客，却意外导致LinkedIn上的文章预览功能失效，且帖子曝光率下降。通过使用LinkedIn Post Inspector工具，作者发现问题的根源在于robots.txt的修改阻止了LinkedIn爬虫获取内容，从而影响了帖子的展示效果。

文章总结

文章《我对robots.txt的看法错了》详细描述了作者在管理个人博客时对robots.txt文件的误解及其带来的后果。以下是主要内容总结：

初始决策与问题出现
作者最初决定禁止所有爬虫访问其网站，以防止数据被滥用。然而，这一决定导致了一个意想不到的问题：他在LinkedIn上分享的文章链接不再显示预览图，且帖子的曝光率逐渐下降。
问题排查
作者通过使用LinkedIn Post Inspector工具，发现问题的根源在于robots.txt文件阻止了LinkedIn的爬虫抓取网页内容。具体错误信息显示，由于robots.txt的限制，LinkedIn无法重新抓取页面以生成预览图。
Open Graph协议的作用
作者进一步了解了Open Graph协议，该协议通过网页中的元数据（如og:title、og:type、og:image和og:url）使网页在社交媒体上显示为丰富的对象。这些元数据是生成链接预览图的关键。
解决方案
为了解决这一问题，作者更新了robots.txt文件，允许LinkedInBot爬虫访问其网站资源。新的配置如下： ``` User-agent: LinkedInBot Allow: /

User-agent: * Disallow: / ```
经验教训
作者反思到，彻底禁止所有爬虫可能会影响内容的展示效果。他在实施这一更改时未充分测试其影响，导致问题的出现。通过这次经历，他学到了更多关于Open Graph协议和工具（如LinkedIn Post Inspector）的知识。
总结
作者强调，在实施任何更改时，必须充分理解相关领域的工作原理。有时，看似简单的决策可能会带来意想不到的后果。正如他所说：“有时候你需要打破几个鸡蛋才能做出煎蛋卷。”

文章通过作者的亲身经历，生动地展示了robots.txt文件在网站管理中的重要性，以及Open Graph协议在社交媒体内容展示中的作用。

评论总结

主要观点总结：

robots.txt的局限性：
- 观点：robots.txt文件对恶意爬虫无效，且对现代网络流量的控制作用有限。
- 论据：
  - "Robots.txt is largely irrelevant now they don't represent most of the traffic problem." (PaulKeeble)
  - "It never did anything for the 'bad' crawlers that would hammer your site!" (Falkon1313)
爬虫的善意与恶意：
- 观点：并非所有爬虫都是恶意的，过度屏蔽可能阻碍合法爬虫的功能。
- 论据：
  - "But don’t just assume that every bot is malicious by default." (dumbfounder)
  - "Even if bot writers WANT to be good, it’s much harder than it should be." (yodon)
robots.txt的历史作用：
- 观点：robots.txt最初用于解决搜索引擎的重复内容惩罚问题，而非控制恶意爬虫。
- 论据：
  - "robots.txt main purpose back in the day was curtailing penalties in the search engines." (Falkon1313)
  - "It was basically a way of saying 'Hey search engines, these are the canonical URLs.'" (Falkon1313)
社交媒体的爬虫问题：
- 观点：社交媒体（如LinkedIn）的爬虫要求严格，且可能对网站造成负担。
- 论据：
  - "LinkedIn is by far the worst offender in post previews." (ceautery)
  - "How much effort should I spend helping a social media site figure out how to render a preview?" (ceautery)
robots.txt的替代方案：
- 观点：需要更细粒度的控制机制，区分爬虫的目的而非身份。
- 论据：
  - "What we need is an ability to block 'AI training' but allow 'search indexing, opengraph, archival.'" (franga2000)
  - "The problem with robots.txt is the reliance on identity rather purpose of the bots." (franga2000)
robots.txt对SEO的影响：
- 观点：屏蔽搜索引擎爬虫可能影响网站的SEO排名。
- 论据：
  - "If you care about Google SEO traffic you maybe want to let them on your site." (jarofgreen)
  - "It’s technically true that you can rank in Google if you block them in robots.txt but it’s going to take a lot more work." (jarofgreen)
robots.txt的滥用与误解：
- 观点：robots.txt不应被滥用，且不应被视为安全工具。
- 论据：
  - "If you don’t want people to crawl your content, don’t put it online." (jjcob)
  - "robots.txt is a neon sign to guide guests to where the beers are, but you’ll still have to secure your gold chains." (acosmism)

总结：

评论普遍认为robots.txt在现代网络环境中的作用有限，尤其是对恶意爬虫无效。同时，过度屏蔽爬虫可能阻碍合法功能，如搜索引擎索引和社交媒体预览。需要更细粒度的控制机制来区分爬虫的目的，而非依赖robots.txt的身份控制。此外，robots.txt不应被视为安全工具，且对SEO有潜在负面影响。

我对robots.txt的看法是错的 -- I was wrong about robots.txt

文章摘要

文章总结

评论总结

主要观点总结：

总结：