Knowledge Should Not Be Gated

(formaly.io)

49 points | by nezhar 6 hours ago

9 comments

  • drunken_thor 1 hour ago
    Sdks/libs, especially open source sdks, were never about gated knowledge. They were about the providing company making it as easy as possible for you to integrate. You would not need to know the idiosyncrasies behind api retries, paging, rate limits, auth flow, and on and on. The third party developers needed a resource, they call a method and get it. Open source libraries especially are about pooling knowledge, not gating it. This is propaganda for pooling that knowledge inside a service you have to pay to use, and instead of developers all using and improving the same codebase together, they have to spend money to rewrite the same code repeatedly. This is AI companies further trying to undercut open source because it’s free.
  • dofm 41 minutes ago
    We are at the breathless-but-low-information-posts-about-plain-text-formats point in the cycle.
    • Herring 10 minutes ago
      This comment is another example of "Nobody ever gets credit for fixing problems that never happened."

      There's a massive push to add unnecessary complexity to everything out there, because complexity pays all our bills.

  • rightbyte 1 hour ago
    It seems beyond naive, rather malicious, to upload any useful private data to SaaS LLMs.

    Like, you are letting them data mine your business. Why are corporations not panicing over this?

    • m11a 1 hour ago
      Most corporations likely have zero data retention agreements with LLM providers, at least for API usage.

      (Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)

      • dataflow 1 hour ago
        > (Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)

        Is actual ZDR verbiage in contracts more specific and limited in scope than what we see advertised publicly ("...except where needed to comply with law or combat misuse" in Anthropic's case)? Because those seem pretty damn vague and large enough holes to drive trucks through.

        • lukewarm707 51 minutes ago
          to combat misuse, we must store and read all prompts and responses. ;)

          to comply with the law, we must send to the police our detections of illegal activity >:|

          a guy subpeonaed your chats, i guess we stored them (oops) so now it's illegal to destroy it...

    • api 18 minutes ago
      Imagine someone comes to you and says: "You must remove your door locks. Anyone can come into your house any time. You also need cameras across most of your house. But in exchange, magic elves will do all of your home chores: washing, dishes, folding laundry, cleaning, minor home repairs. All of this will be done for pennies on the dollar compared to any current option."

      How many people would take it?

      I know I'd actually be tempted. Con: total loss of privacy. Pro: it folds laundry, and I f'ing loathe laundry with the intensity of a billion suns.

      Every business has similar trade-offs they'd be tempted to take.

      • em-bee 0 minutes ago
        i believe that in the future technology will be so advanced that protection of privacy is impossible. the only way to counter that is education to respect peoples privacy and very harsh punishments for violations.

        i also believe that we will live in a post scarcity world, which means profit is no longer interesting, so any business case for invading your privacy will go away and therefore it will only happen for personal interest.

        the key in any case will be education, because without it abuse will be rampant and progress will halt because everyone is going to be suspicious of everyone else.

    • prodigycorp 1 hour ago
      because corporations are using providers with ZDR in the contract. If OAI or any of the cloud providers violate this they're getting sued to oblivion.
      • dofm 1 hour ago
        The problem is that there is an enormous, nearly unignorable incentive to work around it. So they will.

        As the customer base becomes more and more corporate (which it will), they end up with disproportionately more customers whose experiences cannot be used to train the model to make it better for those customers.

        Either way, corporate customers cannot leach off the training from consumers handing over their personal data forever; there aren't enough specialists in that training set to improve the models with no loss of corporate trust.

        Betrayal of their trust is inevitable.

        • WarmWash 35 minutes ago
          Conspiracies are for the chronically online
          • dofm 33 minutes ago
            This is not a conspiracy theory. It's futurology, maybe, but pretty basic stuff at that.

            At some point, where does the training advantage for specialist LLMs come from, if not progressively encroaching on customer data for the benefit of equivalent customers?

      • coffeefirst 51 minutes ago
        These are the same people who performed the largest scale breach of copyright in history on the theory that they could get away with it.

        I’m not making any accusations, but we should not underestimate their tolerance for legal and financial risk.

        It may be a little paranoid to insist on self hosting based on that, but I’m not so sure that it’s crazy.

        • WarmWash 32 minutes ago
          It has been ruled that training on copyright is not a breach of copyright unless you subverted payment for it.

          Which they did do, but scale is relatively miniscule to the full dataset.

        • dofm 38 minutes ago
          Trade secrecy is all anyone has left. It's not paranoid at all. You would hope that most serious companies have a tier of corporate knowledge protection that is somewhere between Coca-Cola/KFC herbs recipe secrecy and Stringer Bell's note-taking exhortation: "is you giving the LLM notes on our unique advantage?"
  • MelonUsk 1 hour ago
    Yep, knowledge should not be gated:

    Imagine Google search without any links or sources named

    This is the “modern” AI chatbot:

    It never mentions the training data it used, in fact has no idea what it used (often FB, Reddit and partisan websites)

    Update: I added the reply about after the fact Googling chatbots do - it’s different

    • dmortin 1 hour ago
      Secifically in Google AI Overview I always see links to sites where the information is sourced from.

      Or at least some of the sites, if the same info is sourced from 100 pages then it only shows 2 or 3, maybe the ones with the biggest PageRanks.

      • MelonUsk 1 hour ago
        Yep, that’s true

        But those links are Googled after the model started to answer, they are not the links to the training data

        Imagine an artificial “librarian” that read all the books and spits hallucinated quotes for you

        But doesn’t let you enter the library, open a single book or even see the sources for those hallucinated quotes

        But instead Googles some sources based on hallucinations after generating them ;-)

        It’s better than nothing but you can Google them, too, while training data (the library) is completely hidden from you, even the public domain parts of it - zero attribution

        • dmortin 1 hour ago
          There should be at least some correlation. When building the model they give more weight to some pages (e.g. Wikipedia) which have bigger trust (pagerank?). And when they provide links in answers, those matches are listed first which have better pagerank for the query.

          So if it sources something in Wikipedia, it is more likely to provide Wikipedia as a trusted source for it.

          The problem is when an answer is hallucinated, false, it may provide a source for it which contains the invalid info.

          • MelonUsk 42 minutes ago
            Yep, a few non-profits work on direct training data attribution:

            OlmoTrace, Guide Labs with Clarity and a few more

            Labs train the model with attribution baked-in and they say the bigger the model - the more interpretable it becomes

            Pretty sure it’s the future

  • internet2000 1 hour ago
    Information wants to be free! I remember when that was the rallying cry of hackers. I miss those days.
    • sghiassy 42 minutes ago
      “Hack the Planet!” — Hackers
  • bonoboTP 21 minutes ago
    AI generated article.
  • 5701652400 1 hour ago
    now that any software/knowledge is copyable given sufficient cash and AIs, gating knowledge migth be the only thing that protects your business. otherwise you do not have business.
  • jdw64 24 minutes ago
    Personally, I think the ability to distinguish between all the knowledge that's overflowing is becoming a characteristic of the current establishment. In reality, the number of sites where you can get good information is extremely limited. It feels like we're in an era where discernment matters more.

    Most of it is just misinformation, after all. People say knowledge shouldn't be restricted, but now we have the opposite problem. There's so much information that just skimming through it takes too much time. On top of that, as we shift from text to video, getting information has become even harder. Compared to text, YouTube videos feel like they have much lower information density. I've heard that the TikTok generation's text literacy is declining, but maybe that's actually a social adaptation to process as much data as possible from low-density sources

    In that sense, the efficiency of RAG ultimately comes down to what kind of good knowledge you're feeding into the AI.

  • nephihaha 1 hour ago
    Sadly it has been during most of human history. I think the establishment resents the masses becoming over educated. The 1990s internet had a wealth of views and information on it. Now you can only access approved sources via search engines thanks to scaremongering, and have CloudFlare monitoring everything you do.