Data Compression Explained (2012)

(mattmahoney.net)

109 points | by mtdewcmu 3 days ago

7 comments

  • rurban 3 hours ago
    The leader boards are from the pre Fabrice Bellard days, btw. Neural network modeling helped finding better patterns in text.

    Also, you could say the same for the related data search problem. How to prepare data, so that it can most efficiently searched. Smallest encoding vs fastest search. Databases are mostly very, very stupid compared to more data-specific tuned algorithms. Like factor 1000 slower and bigger.

  • dang 4 hours ago
    Related:

    Data Compression Explained (2011) - https://news.ycombinator.com/item?id=40631931 - June 2024 (1 comment)

    Data Compression Explained - https://news.ycombinator.com/item?id=5931493 - June 2013 (14 comments)

    Data Compression Explained by Matt Mahoney - https://news.ycombinator.com/item?id=1179242 - March 2010 (1 comment)

  • usernametaken29 2 hours ago
    Isn’t the idea of AI precisely to find universal compression from arbitrary input data, at least with LLMs?
    • eru 11 minutes ago
      No, LLMs only do this for language. They don't try to do this for arbitrary data.
    • briansm 59 minutes ago
      I think so, specifically lossy compression though.

      A modern version of the book would include an extra section in the 'Lossy compression' chapter - 'Text' (alongside Images/Video/Audio) that would discuss LLM's.

      • eru 10 minutes ago
        No, it's not for lossy compression only.

        An LLM can give you a probability distribution for the next token. You can pair that with arithmetic coding to get a lossless compression/decompression algorithm. See https://en.wikipedia.org/wiki/Arithmetic_coding

  • brownpoints 36 minutes ago
    I say transformers are the best compression systems
  • wps 3 hours ago
    This is the guy who created Zpaq btw. Super interesting but niche backup/archive software.
  • NooneAtAll3 3 hours ago
    does anyone have any sources to read about ai-based compression?

    I remember hearing a lot about "compression is a lot about prediction", but I don't remember reading any practical result

    • Karliss 2 hours ago
      It can and has been done just not very practical. Having a dozen GB language model just to squeeze out few more percent on plaintext compression which already compresses well and is tiny in comparison of images or video is not worth it outside benchmarks. Even superior traditional conpression algorithms are often not used due to insufficient software support. Multigabyte decompressor as big as rest of your OS installation is not practical to distribute or standardize. It would also take a lot of memory at runtime for decompressing thus shadowing the efficiency gains in everyday use. Only if you have huge archival scale of data it might be worth the gains. But for long term archival fragile formats which depend on huge arbitrary extra knowledge isnt a good idea. I am not quite sure if ai based compression would make it more robust by allowing to fix corruption based on context or make it worse by having single bitflip produce completely opposite but still plausible looking text. At least with traditional compression its usually obvious when corruption causes gibberish. And then you have problem of versioning, you need to have exactly the same version of dozen GB model for decompression as was used for compression. Just one of them is questionable now imagine having to store few dozen of them. Most computers have code for supporting at least half a dozen compression formats, and many of those are parametrized allowing single algorithm to handle multiple varations of the compression scheme, and then many apps bundle their own copies of compression library.
      • eru 8 minutes ago
        I mostly agree, however:

        > But for long term archival fragile formats which depend on huge arbitrary extra knowledge isnt a good idea.

        This doesn't need to be a problem: you can and should layer an error correcting code on top.

  • blobbers 3 hours ago
    Matt is a great guy to explain this kind of stuff. He's very helpful.