Scaling long-running autonomous coding

(simonwillison.net)

59 points | by srameshc 5 hours ago

5 comments

  • simonw 2 hours ago
    One of the big open questions for me right now concerns how library dependencies are used.

    Most of the big ones are things like skia, harfbuzz, wgpu - all totally reasonable IMO.

    The two that stand out for me as more notable are html5ever for parsing HTML and taffy for handling CSS grids and flexbox - that's vendored with an explanation of some minor changes here: https://github.com/wilsonzlin/fastrender/blob/19bf1036105d4e...

    Taffy a solid library choice, but it's probably the most robust ammunition for anyone who wants to argue that this shouldn't count as a "from scratch" rendering engine.

    I don't think it detracts much if at all from FastRender as an example of what an army of coding agents can help a single engineer achieve in a few weeks of work.

    • shubhamjain 19 minutes ago
      Why attempt something that has abundant number of libraries to pick and choose? To me, however impressive it is, 'browser build from scratch' simply overstates it. Why not attempt something like a 3D game where it's hard to find open source code to use?
      • Banditoz 16 minutes ago
        Is something like a 3D game engine even hard to find source code for? There's gotta lots of examples/implementations scattered around.
    • sealeck 2 hours ago
      I think the other question is how far away this is from a "working" browser. It isn't impossible to render a meaningful subset of HTML (especially when you use external libraries to handle a lot of this). The real difficulty is doing this (a) quickly, (b) correctly and (c) securely. All of those are very hard problems, and also quite tricky to verify.

      I think this kind of approach is interesting, but it's a bit sad that Cursor didn't discuss how they close the feedback loop: testing/verification. As generating code becomes cheaper, I think effort will shift to how we can more cheaply and reliably determine whether an arbitrary piece of code meets a desired specification. For example did they use https://web-platform-tests.org/, fuzz testing (e.g. feed in random webpages and inform the LLM when the fuzzer finds crashes), etc? I would imagine truly scaling long-running autonomous coding would have an emphasis on this.

      Of course Cursor may well have done this, but it wasn't super deeply discussed in their blog post.

      I really enjoy reading your blog and it would be super cool to see you look at approaches people have to ensuring that LLM-produced code is reliable/correct.

      • simonw 2 hours ago
        Yeah, I'm hoping they publish a lot more about this project! It deserves way more then the few sentences they've shared about it so far.
    • teaearlgraycold 6 minutes ago
      It looks like JS execution is outsourced to QuickJS?
    • janoelze 2 hours ago
      Any views on the nature of "maintainability" shifting now? If a fleet of agents demonstrated the ability to bootstrap a project like that, would that be enough indication to you that orchestration would be able to carry the code base forward? I've seen fully llm'd codebases hit a certain critical weight where agents struggled to maintain coherent feature development, keeping patterns aligned, as well as spiralling into quick fixes.
      • simonw 2 hours ago
        Almost no idea at all. Coding agents are messing with all 25+ years of my existing intuitions about what features cost to build and maintain.

        Features that I'd normally never have considered building because they weren't worth the added time and complexity are now just a few well-structured prompts away.

        But how much will it cost to maintain those features in the future? So far the answer appears to be a whole lot less than I would previously budget for, but I don't have any code more than a few months old that was built ~100% by coding agents, so it's way too early to judge how maintenance is going to work over a longer time period.

      • brianjeong 1 hour ago
        I think there's a somewhat valid perspective that the Nth+1 model can simply clean up the previous models mess.

        Essentially a bet that the rate of model improvement is going to be faster than the rate of decay from bad coding.

        Now this hurts me personally to see as someone who actually enjoys having quality code but I don't see why it doesn't have a decent chance of holding

  • halfcat 1 hour ago
    So AI makes it cheaper to remix anything already-seen, or anything with a stable pattern, if you’re willing to throw enough resources at it.

    AI makes it cheap (eventually almost free) to traverse the already-discovered and reach the edge of uncharted territory. If we think of a sphere, where we start at the center, and the surface is the edge of uncharted territory, then AI lets you move instantly to the surface.

    If anything solved becomes cheap to re-instantiate, does R&D reach a point where it can’t ever pay off? Why would one pay for the long-researched thing when they can get it for free tomorrow? There will be some value in having it today, just like having knowledge about a stock today is more valuable than the same knowledge learned tomorrow. But does value itself go away for anything digital, and only remain for anything non-copyable?

    The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?

    • tornikeo 13 minutes ago
      > The volume of a sphere grows faster than the surface area. But if traversing the interior is instant and frictionless, what does that imply?

      It's nearly frictionless, not frictionless because someone has to use the output (or at least verify it works). Also, why do you think the "shape" of the knowledge is spherical? I don't assume to know the shape but whatever it is, it has to be a fractal-like, branching, repeating pattern.

    • ramraj07 19 minutes ago
      The fundamental idea that modern LLMs can only ever remix, even if its technically true (doubt), in my opinion only says to me that all knowledge is only ever a remix, perhaps even mathematically so. Anyone who still keeps implying these are statistical parrots or whatever is just going to regret these decisions in the future.
      • heavyset_go 3 minutes ago
        Yeah, Yann LeCun is just some luddite lol
  • tinyhouse 2 hours ago
    Well, software is measured over time. The devil is always in the details.
    • aronowb14 22 minutes ago
      Yeah curious what would happen if they asked for an additional big feature on top of the original spec
  • anilgulecha 2 hours ago
    That's a wild idea-a browser from scratch! And ladybird has been moving at snails pace for a long time..

    I think a good abstractions design and good test suite will make it break success of future coding projects.

  • vivzkestrel 2 hours ago
    I am waiting for that guy or a team that uses LLMs to write the most optimal version of Windows in existence, something that even surpasses what Microsoft has done over the years and honestly looking at the current state of Windows 11, it really feels like it shouldn't even be that hard to make something more user friendly
    • kimixa 1 hour ago
      Considering Microsoft's significant (and vocal) investment in LLMs, I fear the current state of Windows 11 is related to a team trying to do exactly that.
      • g947o 36 minutes ago
        I noticed that dialog that has worked correctly for the past 10+ years is using a new and apparently broken layout. Elements don't even align properly.

        It's hard to imagine a human developer misses something so obvious.