Show HN: A new benchmark for testing LLMs for deterministic outputs

(interfaze.ai)

23 points | by khurdula 2 hours ago

6 comments

stared 1 hour ago
Thank you for sharing benchmark. However, the results are selective.
Why no Opus 4.7? Why Gemini 3.1 Pro is missing?
If there is some other criterion (e.g. models within certain time or budget), great - just make it explicit.
When I see "Top 5 at a glance" and it missed key frontier models, I am (at best) confused.
[-]
- khurdula 9 minutes ago
  Yeah we selected models that are most commonly integrated in developer workflows and being used for structured output. Typically those models tend to be in the low -mid cost range and with no or low reasoning.
  For the benchmark, was kept consistent across all models and typically opus and 3.1 pro would be overkill and expensive even with reasoning off.
  Good point tho, will add this point in the blog too :)
  Also the benchmark is open source, so anyone can run a model on it and create a PR too, the leaderboard is dynamic and will automatically add that in.
- Flux159 52 minutes ago
  Agree that the choices are strange. Sonnet 4.6 was tested, but no Opus 4.6.
  Gemini 3.1 and GLM 5 came out around the same time as Sonnet 4.6 (~Feb 2026) so it's strange that they are missing, but Gemini 2.5 Flash, Gemini 3 Flash, and GLM 4.7 are there.
zihotki 31 minutes ago
I wonder if this benchmark brings any value. Models are already quite capable and reach high scores in it.
[-]
- khurdula 6 minutes ago
  Check out the "The JSON-pass vs Value-Accuracy gap" section in the blog. That was an eye opener.
  While most models were great at producing JSON schema, they were pretty bad at producing accurate values.
  In the graph you'll is almost a 20%-30% drop between the JSON schema pass vs the value accuracy.
broyojo 13 minutes ago
hmm why can't structured decoding be used?
dalberto 27 minutes ago
A benchmark without Opus 4.6/4.7 feels incomplete.
iLoveOncall 22 minutes ago
This is just a hallucinations benchmark on a subset of outputs, not sure there's a value over general hallucinations benchmarks?
> Our goal is to be the best general model for deterministic tasks
I'm sorry but this simply doesn't make sense. If you want a deterministic output don't use an LLM.
alphainfo 33 minutes ago
[flagged]