Tell NYT, Atlantic, USA Today to keep Wayback Machine
435 points
• 6 days ago
• Article
Link
包括 New York Times 、 The Atlantic 和 USA Today 在内的主要新闻媒体,近期已停止允许 Internet Archive 的 Wayback Machine 保存其内容。倡导者认为,此举威胁到新闻作品的长期可获取性与完整性。自 2026 年初起,这些机构主动封堵该存档工具,理由是担心生成式 AI 公司可能抓取其付费墙后的内容来训练模型。
来自维权组织 Fight for the Future 的请愿发起人反驳称,这些对 AI 的担忧"纯属臆测",忽视了 Wayback Machine 数十年来作为非营利性公共服务的良好记录——它并不会刻意绕过付费墙,并以诚信方式运作。他们指出,即便没有 Wayback Machine,AI 公司也已能直接从出版商网站抓取内容,而 Internet Archive 自愿遵守 robots.txt 等规则以避免此类行为。
这场运动将封堵 Wayback Machine 视为对新闻自由的直接威胁,强调新闻的力量不仅在于发布,更在于为后世保存。发起人举出一个颇具讽刺意味的例子:USA Today 的调查报道本身就依赖 Wayback Machine 归档的网页内容,但该机构同时却阻止自己的报道被保存。超过 100 名记者联署了一封信,赞扬 Internet Archive 的作用并引发公众广泛讨论,但 The Atlantic 的 CEO 明确拒绝承诺寻找恢复存档权限的解决方案。此举发生在全球新闻业面临日益严峻威胁的背景下,包括审查、威权压力,甚至针对记者的暴力,这些都使独立存档比以往更加重要。
请愿书将 Wayback Machine 定位为至关重要的中立第三方,认为被归档的新闻更能抵御那些可能施压媒体修改或删除不利报道的强大势力。这种保障不仅对历史记录至关重要,也被视为强化民主问责的工具——确保事实在政治或商业压力下仍然可查可得。活动人士强调,独立存档符合任何严肃新闻机构追求真相的根本利益,但他们发现与 Internet Archive 在这项基本保存工作上的合作竟异常困难。
请愿的核心诉求很明确:各大媒体机构的领导者必须公开承诺与 Internet Archive 合作,恢复并维持新闻内容的存档。组织者强调,在虚假信息泛滥、记者身处直接威胁的时代,应对这些挑战的办法应是加强保存与开放获取,而非削弱。他们将 Wayback Machine 可能失去新闻内容视为对互联网上最重要档案工具的致命打击,敦促新闻机构立即改弦更张,支持这一维护公共信息完整性的盟友。
Major news outlets like the New York Times, The Atlantic, and USA Today have recently stopped allowing the Internet Archive's Wayback Machine to preserve their content, a move that advocates argue threatens the long-term accessibility and integrity of journalistic work. Since early 2026, these organizations have actively blocked the archiving tool, citing concerns over generative AI companies potentially scraping their paywalled material to train models. Petition organizers from activist group Fight for the Future counter that these AI concerns are "wholly hypothetical" and ignore the Wayback Machine's decades-long track record as a respectful, nonprofit public service that skips paywalls and operates with integrity. They argue that AI companies can already scrape content directly from publisher sites, but the Internet Archive voluntarily adheres to rules like robots.txt to avoid doing so.
The campaign frames the blocking of the Wayback Machine as a direct threat to press freedom, emphasizing that journalism's power depends not only on publication but on preservation for future generations. Organizers highlight a bitter irony exemplified by USA Today, which publishes investigative reporting that itself relies on archived web content through the Wayback Machine while simultaneously preventing its own work from being similarly preserved. Over 100 journalists signed a letter celebrating the Internet Archive's role, which sparked significant public discussion, yet The Atlantic's CEO notably declined to commit to finding a solution that would restore archiving access. This stance comes amid growing global threats to journalism, including censorship, authoritarian pressure, and even violence against reporters, making independent preservation more critical than ever.
Positioning the Wayback Machine as a vital neutral third party, the petition argues that archived news is more resilient against powerful interests who might pressure outlets to alter or remove damaging stories. This safeguard is portrayed as essential not just for historical record, but as a tool that actively strengthens democratic accountability by ensuring facts remain accessible despite political or commercial pressures. The campaign underscores that independent archiving serves the fundamental interest of any serious news organization committed to truth, yet finds it unnecessarily difficult to collaborate with the Internet Archive on this basic function of preservation.
The petition's core demand is straightforward. Leaders of major media outlets must publicly commit to working with the Internet Archive to restore and maintain the archiving of news content. Organizers stress that in an era of rising disinformation and direct threats to journalists, the way to combat these challenges is through more preservation and access, not less. They view the Wayback Machine's potential loss of news content as a mortal peril to the internet's most powerful archiving tool, urging news organizations to immediately reverse course and champion this ally in sustaining the integrity of public information.
119 comments • Comments Link
• archive.org 对 robots.txt 的遵从暴露了一个漏洞:出版商可以通过 robots.txt 阻止其爬虫抓取内容,但他人仍能借助该档案馆大规模获取这些内容。这也意味着,如果 archive.org 自行停止抓取,出版商可能会允许其他途径的访问。
• 令人沮丧的是,遵守 robots.txt 反而让 archive.org 处于不利地位,而一些人(包括资金雄厚的大公司)无视这些规则获利,却往往不会受到惩罚。
• robots.txt 的初衷是控制自动化爬虫,而不是阻止用户通过 Wayback Machine 手动访问 URL 。
• archive.org 明确表示,为了更广泛的可访问性,会忽略 robots.txt 。这一政策自 2017 年宣布后一直在执行。
• 关键问题在于,无论 LLM 公司是从 archive.org 还是直接从原始来源抓取数据,它们都在无视版权法,这使得 archive.org 在版权侵权中的作用相对有限。
• 有人提出折中方案,例如延迟开放访问(30 天到 1 年)或采用托管系统,以在保护出版商收入与保存档案之间取得平衡。
• 通过档案馆绕过付费墙,反而可能利于出版商:它能把一些原本不会接触内容的读者转化为付费订阅用户。
• 档案馆在学术研究和问责中至关重要,它们通过保存勘误、删除记录和文章原始版本来维护事实,否则这些信息可能会丢失。
• 传统媒体档案馆的衰落以及对保存工作的抵制,可能导致"数字黑暗时代",使那些无利可图但重要的信息消失殆尽。
• 利用比特币时间戳或分布式系统等技术构建的加密可验证互联网档案馆,可以在不依赖单一机构的情况下提供防篡改的保存方案。
讨论显示版权执法与信息保存之间存在紧张关系。与会者普遍认为,真正的问题是 AI 公司无视版权法,而非 archive.org 的存档行为。大家也认识到档案馆在研究与问责中的重要性;部分评论者指出,绕过付费墙反而可能将读者转化为订阅用户,从而惠及出版商。对话还涉及新闻业的可持续性、 robots.txt 作为控制手段的局限性,以及需要建立新模式以平衡出版商收入与公众获取信息权利的更广泛议题。 • Archive.org's respect for robots.txt creates a vulnerability where publishers block their crawler, yet people can still scrape publisher content at scale through the archive, suggesting that if archive.org blocked scrapers, publishers might permit access.
• There's frustration that respecting robots.txt puts archive.org at a disadvantage while others profit by ignoring these directives, though large well-funded companies also flout rules without consequence.
• The robots.txt specification was intended to control automated scanning, not prevent individual users from manually requesting URLs through the Wayback Machine.
• Archive.org has explicitly stated they ignore robots.txt for broader access, a policy announced in 2017 and applied consistently since then.
• The core issue is LLM companies disregarding copyright laws regardless of whether they scrape from archive.org or original sources, making the archive's role in copyright infringement negligible.
• Some propose compromises like delayed access (30 days to 1 year) or escrow systems that balance publisher revenue with archival preservation.
• Paywall circumvention through archives actually benefits publishers by converting some non-subscribers into paying subscribers who otherwise wouldn't engage with the content.
• Archives serve crucial functions as research resources and protect truth by preserving corrections, deletions, and original versions of articles that might otherwise be lost.
• The decline of traditional media archives and the fight against preservation efforts threaten to create a "digital dark age" where non-profitable but important information disappears.
• Cryptographically verifiable internet archives using technologies like Bitcoin timestamps or distributed systems could provide tamper-proof preservation without relying on a single organization.
The discussion reveals a tension between copyright enforcement and information preservation, with participants generally agreeing that the real problem lies with AI companies ignoring copyright rather than with archive.org's archival practices. There's recognition that archives serve vital research and accountability functions, with several commenters noting that paywall circumvention can actually benefit publishers by converting readers into subscribers. The conversation touches on broader concerns about the sustainability of journalism, the ineffectiveness of robots.txt as a control mechanism, and the need for new models that balance publisher revenue with public access to information.