Blocking Internet Archive Won't Stop AI, but Will Erase Web's Historical Record

(eff.org)

180 points | by pabs3 6 hours ago

10 comments

VladVladikoff 28 minutes ago
As a site operator who has been battling with the influx of extremely aggressive AI crawlers, I’m now wondering if my tactics have accidentally blocked internet archive. I am totally ok with them scraping my site, they would likely obey robots.txt, but these days even Facebook ignores it, and exceeds my stipulated crawl delay by distributing their traffic across many IPs. (I even have a special nginx rule just for Facebook.)
Blocking certain JA3 hashes has so far been the most effective counter measures. However I wish there was an nginx wrapper around hugin-net that could help me do TCP fingerprinting as well. As I do not know rust and feel terrified of asking an LLM to make it. There is also a race condition issue with that approach, as it is passive fingerprinting even the JA4 hashes won’t be available for the first connection, and the AI crawlers I’ve seen do one request per IP so you don’t get a chance to block the second request (never happens).
[-]
- mycall 7 minutes ago
  Evasion techniques like JA3 randomization or impersonation can bypass detection.
- andrepd 4 minutes ago
  I wonder if it would be practical to have bot-blocking measures that can be bypassed with a signature from a set of whitelisted keys... In this case the server would be happy to allow Internet Archive crawlers.
tossandthrow 1 hour ago
I think media outlets think way too highly of their contribution to AI.
Had they never existed, it had likely not made a dent to the AI development - completely like believing that had they been twice as productive, it had likely neither made a dent to the quality of LLMs.
[-]
- Freak_NL 49 minutes ago
  How do you think those models get trained? You can only get so far with Wikipedia, Reddit, and non-fiction works like books and academic papers.
  [-]
  - tossandthrow 32 minutes ago
    Have a look at this article: https://www.washingtonpost.com/technology/interactive/2023/a...
    NY Times is 0.06% of common crawl.
    These news media outlets provide a drop in the ocean worth of information. Both qualitatively and quantitatively.
    The news / media industry is really just trying to hold on to their lifeboat before inevitably becoming entirely irrelevant.
    (I do find this sad, but it is like the reality - I can already now get considerably better journalism using LLMs than actual journalists - both click bait stuff and high quality stuff)
    [-]
    - pimlottc 17 minutes ago
      That seems like a reductive way to consider it. What percent of music was created by Led Zeppelin? What percent of art was painted by Monet? What percent of films by Alfred Hitchcock? It may be a small percentage objectively but they are hugely influential.
  - RugnirViking 45 minutes ago
    How does the entire textual corpus of say, new York times compare to all novels? Each article is a page of text, maybe two at most? There certainly are an awful lot of articles. But it's hard to imagine it is much more than a couple hundred novels. There must be thousands of novels released each year
    [-]
    - Freak_NL 26 minutes ago
      Like apples to oranges.
      LLMs are (apparently) massively used to get information about topics in the real world. Novels aren't going to be much help there. Journalism, particularly in written form, provides a fount of facts presented from different angles, as well as opinions, and it was all there free for the taking…
      Wikipedia provides the scantest summary of that, fora and social media give you banter, fake news, summaries of news, and a whole lot of shaky opinions, at best. Novels give you the foundations of language, but in terms of knowledge nothing much beyond what the novel is about.
      [-]
      - olalonde 9 minutes ago
        LLMs can get up to date information from primary sources - no journalists required.
- phatfish 24 minutes ago
  Isn't the non-LLM generated text becoming more valuable for training as the web at large is flooded with slop?
  Preventing new human generated text from being used by AI firms (without consent) seems like a valid strategy.
stuaxo 28 minutes ago
The New York Times is awful I want it to be archived so people can see that in the future.
gzread 2 hours ago
This is why archive.is was created. Should we stop trying to hunt down and punish its creator and support it as the extremely useful project that it is?
[-]
- 8cvor6j844qw_d6 14 minutes ago
  Agreed, and if archive.is goes down, archive.org becomes the de facto monopoly in web archival.
  That's a problem because archive.org honors removal requests from site owners. Buy an old domain and you can theoretically wipe its archived history clean.
- philistine 1 hour ago
  The creator can maintain anonymity. The creator does not deserve to continue being celebrated when they embarked on a DDOS campaign using the traffic of archive.is against a journalist trying to uncover their identity. By these actions, they have shown to be capricious, vindictive, and willing to ensnare their users in their DDOS of others. Whoever they are, they’re terrible.
  [-]
  - rdevilla 1 hour ago
    This is great. Journalists are impeding the preservation of the historical record by blocking archivist traffic while simultaneously manhunting those archivists who find ways around their authwalls.
    Soon the news and the historical facts will be unnecessary. You can simply receive your wisdom from the AIs, which, as nondeterministic systems, are free to change the facts at will.
    [-]
    - Permit 39 minutes ago
      >This is great. Journalists are impeding the preservation of the historical record by blocking archivist traffic while simultaneously manhunting those archivists who find ways around their authwalls.
      You are deliberately misrepresenting the situation. The journalists who block archivist traffic are not in any way connected to the blogger who was attempting to investigate the creator of archive.is. You have portrayed them as related in an attempt to garner sympathy for the creator of archive.is.
      Here is an account of the facts: https://gyrovague.com/2026/02/01/archive-today-is-directing-...
  - gzread 1 hour ago
    Their life is in danger and one particular journalist is making it so
  - Obscurity4340 1 hour ago
    I had no idea that was the actual situation (journalist trying to hunt them down). Sorta changes the moral calculus, I'll allow it
  - choo-t 1 hour ago
    Well, if they deserve anonymity, they also deserve to be able to protect it, and they have really few tools against a doxxing, the DDOS was one of them, corrupting the archived article was another, albeit dangerous for their own reputation as an archiver.
    The crux of the problem was the doxxing, not the defense against it.
    [-]
    - ajam1507 1 hour ago
      You don’t think leveraging your site to DDOS someone is a problem?
      Do people not also deserve to be protected from being DDOSed? Do people also not deserve to not have their internet traffic be used to DDOS someone?
      [-]
      - staticassertion 43 minutes ago
        I think this is a weak framing. Lots of things are moral or immoral under specific circumstances. We should protect people from being murdered. I think murder is usually wrong. But we also likely agree that there are circumstances in which killing someone can be justified. If we can find context for taking a life, I'm quite sure we can find context for a DoS.
        [-]
        ajam1507 27 minutes ago
        And what’s the context for using the internet traffic of your unsuspecting users to accomplish this?
        [-]
        choo-t 23 minutes ago
        Using the internet trafic of the persons using your service to protect your anonymity and thus, protecting the service itself.
        [-]
        ajam1507 17 minutes ago
        So you shouldn’t have to inform your users that their traffic will be used in a cyberattack?
      - kpcyrd 1 hour ago
        You don't think non-consensually revealing somebody's identity is a problem?
        Resorting to DDoS is not pretty, but "why is my violent behavior met with violence" is a little oblivious and reversal of victim and perpetrator roles.
        [-]
        ajam1507 40 minutes ago
        > You don't think non-consensually revealing somebody's identity is a problem?
        I do think it’s a problem. You are the only one excusing bad behavior here.
      - psychoslave 1 hour ago
        Not defending any party, it's basic ethological expectation: a creature that try to beat an other should expect aggressive response in return.
        Of course, never aggressing anyone and transform any aggression agaisnt self into an opportunity to acculturate the aggressor into someone with the same empathic behavior is a paragon of virtuous entity. But paragons of virtue is not the median norm, by definition.
        [-]
        ajam1507 18 minutes ago
        > Not defending any party, it's basic ethological expectation: a creature that try to beat an other should expect aggressive response in return.
        Another basic ethological expectation is that the strong dominate the weak, but maybe we shouldn’t base our moral framework around how things are, and rather on how they should be.
      - choo-t 1 hour ago
        > You don’t think leveraging your site to DDOS someone is a problem?
        It is, but it's one of the only tools they have to prevent the doxxing site to being reachable.
        > Do people not also deserve to be protected from being DDOSed?
        You mean the person doing the doing should be protected ?
        >Do people also not deserve to not have their internet traffic be used to DDOS someone?
        Yes, it should have been opt-in. But unless you doesn't run JS, you kinda give right to the website you visit to run arbitrary code anyway.
  - MSFT_Edging 1 hour ago
    If there's ever something a journalist would never ever do, it's destroy someone's life for a headline. Never ever. Totally impossible.
  - staticassertion 45 minutes ago
    They're terrible for not wanting to be dox'd?
user_7832 2 hours ago
> But in recent months The New York Times began blocking the Archive from crawling its website, using technical measures that go beyond the web’s traditional robots.txt rules. That risks cutting off a record that historians and journalists have relied on for decades. Other newspapers, including The Guardian, seem to be following suit.
I'm a bit surprised I never read about this till now, though while disappointing it is unfortunately not surprising.
> The Times says the move is driven by concerns about AI companies scraping news content. Publishers seek control over how their work is used, and several—including the Times—are now suing AI companies over whether training models on copyrighted material violates the law. There’s a strong case that such training is fair use.
I suspect part of it might be these corps not wanting people to skip a paywall (whether or not someone would pay even if they had no access is a different story). But this argument makes no sense for the Guardian.
[-]
- user_7832 2 hours ago
  I went to Guardian's website to cross check their motto (getting confused with WaPo's motto) and got served this (hilarious? sad?) banner. As if blocking cross website tracking is somehow bad.
  > Rejection hurts … You’ve chosen to reject third-party cookies while browsing our site. Not being able to use third party cookies means we make less from selling adverts to fund our journalism.
  We believe that access to trustworthy, factual information is in the public good, which is why we keep our website open to all, without a paywall.
  If you don’t want to receive personalised ads but would still like to help the Guardian produce great journalism 24/7, please support us today. It only takes a minute. Thank you.
  [-]
  - duskdozer 1 hour ago
    >If you don’t want to receive *personalised ads*
    So ads, just not personalized. Remind me again why personalized ads are good for me if I have to pay to have non-personalized ads?
  - mocd 1 hour ago
    The Guardian’s ads asking for contributions have got progressively more desperate. I find their commitment to keeping their site paywall free admirable, but the current almost-begging (and selling off their Sunday paper) has got so intense that it feels like it’s only a matter of time until they introduce some kind of paid content.
xnx 3 hours ago
Does Internet Archive have a distributed residential IP crawler program? I would enthusiastically contribute to that.
There must be some mechanism to prevent tampering in such a setup.
[-]
- progval 2 hours ago
  The Internet Archive does not, but Archive Team does: https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
  [-]
  - xnx 2 hours ago
    Yes! I'm running an instance right now.
- gzread 2 hours ago
  No, IA does everything above board and even honors invalid DMCA takedowns.
- Retr0id 2 hours ago
  > There must be some mechanism to prevent tampering in such a setup.
  Trivial as long as they terminate the TLS on their end, not yours. So you'd just be a residential proxy.
b1n 19 minutes ago
Archive now, make public after X amount of time. So, maybe both publisher and archiver are happy (or less sad).
Havoc 42 minutes ago
As someone perpetually online it’s also making me rethink that a bit
Unless you love walled gardens, doomscrolling and endless AI slop that seems like the fun is over
SlinkyOnStairs 2 hours ago
Devil's advocate: Anyone seeking to limit AI scraping doesn't have much of a choice in also blocking archivists.
And it's genuinely not that weird for news organisations to want to stop AI scraping. This is just a repeat of their fight with social media embedding.
Sure. The back catalogue should be as close to public domain as possible, libraries keeping those records is incredibly important for research.
But with current news, that becomes complicated as taking the articles and not paying the subscription (or viewing their ads) directly takes away the revenue streams that newsrooms rely on to produce the news. Hence the "Newspaper trying to ban linking" mess, which was never about the links themselves but about social media sites embedding the headline and a snippet, which in turn made all the users stop clicking through and "paying" for the article.
Social media relies on those newsrooms (same with really, most other kinds of websites) to provide a lot of their content. And AI relies on them for all of the training data (remember: "Synthetic data" does not appear ex nihilo) & to provide the news that the AI users request. We can't just let the newsrooms die. The newsroom hasn't been replaced itself, it's revenue has been destroyed.
---
And so, the question of archives pops up. Because yes, you can with some difficulty block out the AI bots, even the social media bots. A paywall suffices.
But this kills archiving. Yet if you whitelist the archives in some way, the AI scrapers will just pull their data out of the archive instead and the newsrooms still die. (Which also makes the archiving moot)
A compromise solution might be for archives to accept/publish things on a delay, keep the AI companies from taking the current news without paying up, but still granting everyone access to stuff from decades ago.
There's just major disagreement about what a reasonable delay is. Most major news orgs and other such IP-holders are pretty upset about AI firm's "steal first, ask permission later" approach. Several AI firms setting the standard that training data is to be paid for doesn't help here either. In paying for training data they've created a significant market for archives, and significant incentive to not make them publicly freely accessible.
Why would The Times ever hand over their catalogue to the Internet Archive if Amazon will pay them a significant sum of money for it? The greater good of all humanity? Good luck getting that from a dying industry.
---
Tangent: Another annoying wrinkle in the financial incentives here is that not all archiving organisations are engaging in fair play, which yet further pushes people to obstruct their work.
To cite a HN-relevant example: Source code archivist "Software Heritage" has long engaged in holding a copy of all the sourcecode they can get their hands on, regardless of it's license. If it's ever been on github, odds are they're distributing it. Even when licenses explicitly forbid that. (This is, of course, perfectly legal in the case of actual research and other fair use. But:)
They were notable involved in HuggingFace's "The Stack" project by sharing a their archives ... and received money from HuggingFace. While the latter is nominally a donation, this is in effect a sale.
---
I find it quite displeasing that the EFF fails to identify the incentives at play here. Simply trying to nag everyone into "doing the thing for the greater good!" is loathsome and doesn't work. Unless we change this incentive structure, the outcome won't change.
[-]
- Obscurity4340 1 hour ago
  It would be better if there was some arrangement the papers could reach with Archive where they just delay the release or wait a week then its part of the archive. That way, news stuff gets paid for when its hot and fresh but then it gets archived and the record is preserved
- onetokeoverthe 2 hours ago
  [dead]
daliliu 59 minutes ago
[dead]