Show HN: State of the Art of Coding Models, According to Hacker News Commenters

(hnup.date)

40 points | by yunusabd 3 hours ago

11 comments

2ndorderthought 58 minutes ago
Interesting to see the positive sentiment around kimi2.6 qwen3.6 and deepseek relative to the negative. I hope the trend of people appreciating open models continue. They aren't namesakes yet, but it's a higher percentage then I thought it would be. Especially on HN where we are all talking about businesses.
I am upset because now anthropic, openai, meta, etc will continue their smear campaigns here. But I am also happy because it will make HN less useful when they do.
Everything is a give and take I guess. Excited to see where the equilibrium sits
[-]
- SilverElfin 6 minutes ago
  Is it just “smear campaigns”? Don’t get me wrong - I don’t want big tech or big AI monopolies and appreciate the open weight models. But it’s also true that Chinese companies are basically stealing through distillation and also that they censor to align to CCP rules. They’re problematic in a different way.
  What I want is more fully open models where everything is shared. Data, training algorithms, weights. That way we can figure out if we should trust it.
jdw64 3 hours ago
Interpreting these metrics is quite interesting.
One thing for sure is that while Claude is currently taking the #1 spot in mentions, it carries a lot of negative sentiment due to API pricing policies and frequent server downtime. On the other hand, the runner-up, GPT-5.5, actually seems to have more positive feedback.
Personally, my experience with Codex wasn't as good as with Claude Code (Codex freezes on Windows more often than you'd expect), so this is a bit surprising. That said, the more defensive GPT is definitely better in terms of sheer code-writing capability. However, GPT actually has quite a few issues with text corruption when generating in Korean or Chinese—something English-speaking users probably don't notice. In terms of model capabilities, when given the same agent.md (CLAUDE.md) file, I think GPT is better at writing code, while Claude is better at writing text during code reviews.
Looking at the bottom right, Qwen and DeepSeek are open-source, so they are largely mentioned in the context of guarding against vendor lock-in, which drives positive sentiment. Considering that Hacker News occasionally shows negative sentiment toward China, the fact that they are viewed this positively—unlike US models—shows that being open-source is a massive advantage in itself.
Anyway, one thing for sure is that Gemini is pretty much unusable.
[-]
- 2ndorderthought 51 minutes ago
  I like your analysis but I think the open models are genuinely well received not only because of vendor lock in or being open source.
  They are cheaper! All signals point to them staying cheaper because they are built more sustainably. Also, some of the latest entries can run on 1 GPU! Literally available at your desktop where there can be no service interruptions. Not even network latency. People are one and few shotting little games for 0 dollars because they bought a GPU to play video games this year. To me that's an unbeatable value. Once the tooling catches up and a few more model releases, it could change everything completely.
- dgacmu 37 minutes ago
  I had a surprisingly positive experience with Gemini optimizing some mathy MPS code. It did far better than claude.
  Of course, when I tried it on something else it rewrote every line in the file for no good reason, applied changes directly when I told it just to plan, etc.
  So maybe it has one strength.
- awesome_dude 57 minutes ago
  > Anyway, one thing for sure is that Gemini is pretty much unusable
  Ha! I find that Gemini is quite useful - if only because I am forced to use it (on my personal projects) because it's the only one that has unlimited interaction for "free"
  It has its limitations, yes, but so does Claude (which I am leaning on too heavily at work at the moment)
gobdovan 56 minutes ago
Before harnesses, I'd fix the methodology/claims. A saner methodology would be to see comments that compare two models, say 'gpt5.5>opus4.7' and infer context ('ctx:frontend', for example). For your current methodology, 'opus 4.6 was very smart, opus4.7 is a disappointing upgrade to 4.6' would make normal aspect-based sentiment analysis consider 4.6 is smarter than 4.6. But considering you have <300 mentions total, probably you'd be better off scrapping some other websites as well. I'd also take out completely the SotA claim and downgrade the mentions to measuring something like visibility rather than performance.
[-]
- yunusabd 23 minutes ago
  That's fair, my immediate concern would be that there would be very few comments comparing any two models, so the data would be very anecdotal.
  The context would be really nice to have, but reading the comments myself, it often just isn't very clear what exactly users are building or which programming language they are using.
  I think analyzing more comments is promising. If you get enough data, you can generalize across use cases and get more meaningful ratings. The obvious lever is including more posts, although it might hit diminishing returns. I'll play around with it.
  For the context, I want to try giving Gemini a "scratch pad", where it can note down strengths and weaknesses per model that it finds in the comments. Something like "some users say that model x is good for writing tests". Then on each run, I let it update the scratch pad and publish the results as more of a qualitative analysis.
  For the wording, I'd like to keep a certain amount of click bait, sorry ;)
chillfox 30 minutes ago
Surely "Claude Opus 4.7" and "Claude Opus Latest" should be the same, right?
idivett 54 minutes ago
Thanks for doing the hard work. I've bookmarked this, hoping it'll come handy when new models are released. If you're taking feature requests, I've a few. - Show combined measurements of model makes. Like All claude models vs open ai, Deepseek so on. - Another toggle to remove the neutral section?
Jabbles 3 hours ago
Please fix your graph so the names of the models are readable
[-]
- marcuskaz 3 hours ago
  Also, the stacked graph only allows you to quickly see total mentions, really hard to compare negative or positive sentiment across models at a glance.
  [-]
  - yunusabd 2 hours ago
    Yep, a toggle to scale all columns to the same height could solve this. I'll look into it when I do the custom graph.
    Edit: Done
    [-]
    - marcuskaz 1 hour ago
      Much better, nice update!
- yunusabd 1 hour ago
  Thanks for the comment, should be fixed now.
- smeej 2 hours ago
  Came here to offer this feedback. If I can't see the name of the model, nothing else in the chart really matters to me. I even tried going to the Google Sheet.
  It's way too important a piece of information not to have it visible.
  [-]
  - yunusabd 1 hour ago
    Thanks, I replaced it with a custom graph, should be easier to read now.
- yunusabd 3 hours ago
  [dead]
Hari2028 24 minutes ago
How noisy is the sentiment classification? Feels like that could skew results a lot
brooksc 2 hours ago
It'd be interesting to also graph this over time to see how sentiment changes from when a model is released to today.
pbgcp2026 1 hour ago
So, it's a webpage with 3 paragraphs and a simple chart. It has: 1) terrible color scheme – fine, I switch to reader mode 2) shitloads of JS - fine, NoScript works, page breaks 3) Fancy "design" with simple graph but unreadable X axis labels - fine, I can use screen zoom for that ... to see 3x "Claude O..." LOL are we playing guess-me-over game? 4) ... "LxxxLxxx - Learn languages with YouTube!"
yakkomajuri 3 hours ago
"Prompts an LLM" -> which LLM?
I saw you're using Gemini for the sentiment rating (which I guess you picked because it's not often mentioned and thus "neutral"? lol)
But would be interesting to get more details overall
[-]
- yunusabd 2 hours ago
  It's actually ChatGPT at the moment for the first filtering step, for no other reason than having a code snippet ready that I could point Cursor at (I know, so 2025). The Gemini call is using batch processing, so it's handled differently.
ranger_danger 3 hours ago
Just FYI this article seems to define "start of the art" as "popular", as measured by "total mentions and user sentiment", without any bearing on the technical abilities or actual usage of the model.
[-]
- yunusabd 2 hours ago
  Calling it sota might be a bit provocative, but what actually is the "state of the art"? We have benchmarks, but those are getting increasingly gamed and don't necessarily reflect the actual performance of a model, see Opus 4.7. So I think it's useful to have real world data from actual users as an additional data point.
- mellosouls 2 hours ago
  That's pretty much exactly what the title says.
  The technical abilities and usage are derived from the commenters usage reflections.