Show HN: Bonsai 1.7B ternary model at 442T/s on M4 Max

(agents2agents.ai)

13 points | by hhuytho 13 hours ago

2 comments

  • dsecurity49 12 hours ago
    That performance jump is incredible. Curious to know if the evolution search found any specific optimizations that were counter-intuitive to how we normally write Metal kernels?
    • hhuytho 11 hours ago
      Yes, a few interesting observations:

      - Instead of the conventional wisdom for fusion: "fuse early, fuse aggressively", the search does the opposite for Q. It fuses K's RMSNorm at K-cache-write time (one norm. for the whole K matrix), but defers Q's RMSNorm to attention kernel's prologue.

      - The result_output of Q2_0 kernel was rewritten to process 2 output rows per SIMD lane instead of 1, with nsg=8. This is against the common Metal advice of maximizing occupancy to keep simdgroups busy. The advantage is that each y vector gets reused across two accumulators, halving DRAM bandwidth for the y operand.

      We didn't suggest either of these. The agent had the upstream code, a benchmark, and a correctness check.

  • rpdaiml 9 hours ago
    Nice work, that throughput is wild.