I use this approach for a ticket based customer support agent. There are a bunch of boolean checks that the LLM must pass before its response is allowed through. Some are hard fails, others, like you brought up, are just a weighted ding to the response's final score.
Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).
This actually seems really good advice. I am interested how you might tweak this to things like programming languages benchmarks?
By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.
Not sure I'm fully following your question, but maybe this helps:
IME deep thinking hgas moved from upfront architecture to post-prototype analysis.
Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging
With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate
When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.
The shift: from "design away problems" to "evaluate into solutions."
Depends on what you’re doing. Using the smaller / cheaper LLMs will generally make it way more fragile. The article appears to focus on creating a benchmark dataset with real examples. For lots of applications, especially if you’re worried about people messing with it, about weird behavior on edge cases, about stability, you’d have to do a bunch of robustness testing as well, and bigger models will be better.
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
One point in favor of smaller/self-hosted LLMs: more consistent performance, and you control your upgrade cadence, not the model providers.
I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.
How much you value control is one part of the optimization problem. Obviously self hosting gives you more but it costs more, and re evals, I trust GPT, Gemini, and Claude a lot more than some smaller thing I self host, and would end up wanting to do way more evals if I self hosted a smaller model.
(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)
You may also be getting a worse result for higher cost.
For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.
Still waiting on human evaluation to confirm the LLM Judge was correct.
You obviously know what you’re looking for better than me, but personally I’d want to see a narrative that made sense before accepting that a smaller model somehow just performs better, even if the benchmarks say so. There may be such an explanation, it feels very dicey without one.
You're right. We did a few use cases and I have to admit that while customer service is easiest to explain, its where I'd also not choose the cheapest model for said reasons.
The author of this post should benchmark his own blog for accessibility metrics, text contrast is dreadful..
On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.
Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used
Cut all my subs, spend less money, don't get rate limited
I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.
I have been benchmarking many of my use cases, and the GPT Nano models have fallen completely flat one every single except for very short summaries. I would call them 25% effectiveness at best.
Plus I've found that overall with "thinking" models, it's more like for memory, not even actual perf boost, it might even be worse because if it goes even slightly wrong on the "thinking" part, it'll then commit to that for the actual response
for sure, the difference in the most recent model generations makes them far more useful for many daily tasks. This is the first gen with thinking as a significant mid-training focus and it shows
It's perfectly possible it's someone with deep domain experience, or someone who has product design or management skills. Regardless, dismissing these people out of pocket is not likely the best choice.
Totally agree with your point. While I can't say sadly, but it's a traditional (German) business he's doing vertically integrated with AI. Customer support is really bad in this niche and by leveraging AI on top of doing the support himself 24/7, he was able to make it his competitive edge.
Amazon Bedrock Guardrails uses a purpose-built model to look for safety issues in the model inputs/outputs. While you won't get any specific guarantees from AWS, they will point you at datasets that you can use to evaluate the product and then determine if it's fit for purpose according to your risk tolerance.
- Did it cite the 30-day return policy? Y/N - Tone professional and empathetic? Y/N - Offered clear next steps? Y/N
Then: 0.5 * accuracy + 0.3 * tone + 0.2 * next_steps
Why: Reduces volatility of responses while still maintaining creativeness (temperature) needed for good intuition
Failures are fed back to the LLM so it can regenerate taking that feedback into account. People are much happier with it than I could have imagined, though it's definitely not cheap (but the cost difference is very OK for the tradeoff).
By having independent tests and then seeing if it passes them (yes or no) and then evaluating and having some (more complicated tasks) be valued more than not or how exactly.
IME deep thinking hgas moved from upfront architecture to post-prototype analysis.
Pre-LLM: Think hard → design carefully → write deterministic code → minor debugging
With LLMs: Prototype fast → evaluate failures → think hard about prompts/task decomposition → iterate
When your system logic is probabilistic, you can't fully architect in advance—you need empirical feedback. So I spend most time analyzing failure cases: "this prompt generated X which failed because Y, how do I clarify requirements?" Often I use an LLM to help debug the LLM.
The shift: from "design away problems" to "evaluate into solutions."
Another big problem is it’s hard to set objectives is many cases, and for example maybe your customer service chat still passes but comes across worse for a smaller model.
Id be careful is all.
I'd push everyone to self-host models (even if it's on a shared compute arrangement), as no enterprise I've worked with is prepared for the churn of keeping up with the hosted model release/deprecation cadence.
(Potentially interesting aside: I’d say I trust new GLM models similarly to the big 3, but they’re too big for most people to self host)
For a medical use case, we tested multiple Anthropic and OpenAI models as well as MedGemma. Pleasantly surprised when the LLM as Judge scored gpt5-mini as the clear winner. I don't think I would have considered using it for the specific use cases - assuming higher reasoning was necessary.
Still waiting on human evaluation to confirm the LLM Judge was correct.
On the other hand, this would be interesting for measuring agents in coding tasks, but there's quite a lot of context to provide here, both input and output would be massive.
Any resources you can recommend to properly tackle this going forward?
Since building a custom agent setup to replace copilot, adopting/adjusting Claude Code prompts, and giving it basic tools, gemini-3-flash is my go-to model unless I know it's a big and involved task. The model is really good at 1/10 the cost of pro, super fast by comparison, and some basic a/b testing shows little to no difference in output on the majority of tasks I used
Cut all my subs, spend less money, don't get rate limited
I've been using the smaller models ever since. Nano/mini, flash, etc.
I have found out recently that Grok-4.1-fast has similar pricing (in cents) but 10x larger context window (2M tokens instead of ~128-200k of gpt-4-1-nano). And ~4% hallucination, lowest in blind tests in LLM arena.
I'm unwilling to look past Musk's politics, immorality, and manipulation on a global scale
gemini-3-flash stands well above gemini-2.5-pro
Stop prompt engineering, put down the crayons. Statistical model outputs need to be evaluated.
It sounds like he's building some kind of ai support chat bot.
I despise these things.
read: he's a professional idiot