"My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing."
It's a good reminder for us all that the competition in this space is rough and lots of more or less subtle marketing is involved.
Mythos marketing really leans into that "too powerful to be legal" vibe, much like how PS2s were allegedly banned from North Korea because their chips were basically missile-grade.
I'd go out and say the marketing is not subtle. The hype and fanboys/girls are so in line with the marketing that any level of skepticism is seen a an act of defection, but if you look at the words, hyperbole and volume that is used, there is nothing subtle about it.
It's almost Trump-esque - "this model will change everything forever; we are doomed; we are saved; we will all be fired; we will all be rich", etc
That's a pretty good encapsulation of the parallels between the political and the technological: One necessarily thrives upon the other and are inextricable. This moment is a culmination of all the disenfranchisement the bodypolitik have suffered, looking for any possible means of escape or elevation. AI and Trumpism, for their own respective cohorts, are salvation, on offer by different frontmen but ultimately in service of the same system.
They need the hype to pay off way more than we do. So many of us who still write code directly stand to lose nothing of our capabilities if the marketing claims cannot hold water.
I seem to be totally outside the hype bubble, but I have to suspect there is a lot of imagineering and wild extrapolations in the elss technical hype bubbles. I am curious but no enough to go looking.
I'm surprised you say that because it is all over Hacker News. Every single post is co-opted into promoting AI. Try finding a submission with fifty points or more than doesn't have AI or LLM's mentioned somewhere in the comments.
> An amazingly successful marketing stunt for sure.
This. Well done by Antropic.
It even reached the CISO of my small semi-government org in the Netherlands, who slightly panicked at the announced 'tsunami' of vulnerabilities that was coming with Mythos.
Got us some more money and priority with the board, though.
Sure, but isn't it a verdict on Mythos compared to other models?
If so, it would still follow. "Most software" isn't analyzed as much as curl, by either other tooling or other models, that might well find close to the same as Mythos did. As such, Mythos then isn't especially/particularly dangerous.
I don't think I understand what you mean, the "not particularly dangerous" comment was in relation to the vulnerability that was found right ? Surely they would know what constitutes a lower severity level.
My guess is that it is in category of "you are holding it wrong". Still worth fixing, but requires very specific user input for example. Or very weird scenario. Or in some less used protocol or flag combination.
It's a shame he seems to reject the idea of actually diving in and using these tools interactively:
> It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway.
His expertise I think would elevate the results quite a bit. Although if he never uses LLMs, which it reads like he doesn't, I guess it might backfire just as well. Prompting style (still?) does matter after all, certainly in my experience anyways.
There is always marketing involved and people should be able to put marketing into perspective.
Also curl in this regard is a open source project, relativly small but critical, well known and used everywhere. Besides image libraries, tools like curl or sudo, su, passwd, etc. would also be my first try.
Mythos is still not known at all what it can do. What does it mean from cost and benchmark pov to have a 10 Trillion parameter model?
Nonetheless, the fact that LLMs got significant better in finding this, better than humans, started to happen half a year ago? so at one point we need to address the elefant in the room and state that today you need to do security scanning additional with LLMs. You need to take this serious.
In worst case, use Anthropics marketing to state that its a must now and something changed.
> Nonetheless, the fact that LLMs got significant better in finding this, better than humans, started to happen half a year ago?
*rools eyes* regular static analyzers also have been "better than humans" for decades, being better than a human at a specific mechanical task really doesn't mean much. The interesting new thing is the type of potential "fuzzy bugs" described in the article that LLMs are able to identify (a comment not matching the code it describes, uncommon usage of a 3rd party library, mismatch of code and a protocol it implements, or often just generally weird looking code somebody should have a closer look at... this closes a gap in the traditional debugging toolboxes, but shouldn't replace them)
> The single confirmed vulnerability is going to end up a severity low CVE planned to get published in sync with our pending next curl release 8.21.0 in late June
My mind still cannot understand the quality and refinement that's gone into cURL. It really is the perfect example of something done so right, that people barely think twice about.
Easy, it shows what is achievable if there is a high bar for quality in every single line of code that gets commited, reviewed and merged, regardless of the programming language.
However in the days of race to bottom, offshoring for penies, and now LLM powered code generation, this is a quality most companies won't care unless there is liability in place.
Curl and SQLite are my favourite examples of properly engineered, rigourously tested _anything_. It's really philosophical - those projects' contribution requirements demand such rigor, and the maintainers stand by that demand. A non-load-bearing document (not project code) is what makes that possible - very reminiscent of Einstein's thought experiments leading to tangible projects such as GPS or Descartes's belief that all problems can be solved through rational thinking.
I don't know about Mythos but in recent weeks I've noticed Opus is constantly failing to fix things in tsz[0] vs GPT 5.5 can easily churn out fixes that are solid and pass tests. I've stopped paying for Claude for now and all my money is going to OpenAI at the moment. Either Opus is massively nerfed or GPT 5.5 is really head and shoulder higher in terms of very difficult tasks. The last percent of conformance tests in tsz are really really difficult and I've seen Opus bailing again and again. So annoying to waste time and tokens to finally get "this is too involved" or "this requires a multi-week sprint to fix".
Putting on my tinfoil-hat: Sooo, the guy who runs the test and delivers the report could just have removed the more interesting bugs and delivered those to any three letter agency?
No, based on cURL's history, it really seems like they would love to have found a really novel bug. Now if it was a for profit company.. Tinfoil hat would be shared!
Curl is likely one of the very much more combed over pieces of code at this point. It feels like it has some special draw for people looking for vulnerabilities. Not that it doesn't mean some novel idea can't be looked or checked still.
> No, based on cURL's history, it really seems like they would love to have found a really novel bug.
You just confirmed that you didn't read the article.
"Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report."
I routinely used to compile C programs on other compilers to find defects that one or another didn't find. Compiling on Windows vs Linux. You could summarize / minimize it down to compiling it with warning as errors etc but you'd be missing the point.
The point wasn't actual cross-platform portability even though that was a nice side effect. It was to flush out all the weird edge cases.
Edges like security flaws. Buffer overflows are usually platform specific. There are plenty of other ways to find these issues but simply recompiling for a different platform surfaces all sorts of issues.
Voice input works really well for people speaking English with a Swedish accent. I think the accent of most educated Swedes is mostly a case of prosody. For sure there are some sounds we say slightly differently than native English speakers. We often have some trouble with /s/ and /z/, but I don't know, "war and peace", I think that's easily understood.
Source: voice typing this with Swedish vocal chords, and only had to correct "different lives" to "differently", and add /[^\w\s]/.
Android voice input works with kids using both English and native words, here in India. The country runs schools in 25+ primary languages, each with dialects, so a TV/phone with voice input is more marvelous than the nitpicks discussed here.
War and Peace is about 590,000 words. Tiny compared to the full Harry Potter collection (about 1 million words over the 7 books), but long for a single book.
They're referring to the typo in the title, "Piece" vs "Peace".
I also thought they were contending the word count before noticing. Even remarked how I find this a weird metric, given that code is not prose [0], but then I deleted that once I picked up on what's going on.
[0] comparing the output of `wc -w` with the word counts of books I'm reasonably sure will be super off
Big whoopdy-doo. I find vulnerabilities in every code base I examine, too. Doesn’t make me a super intelligence, that my vulnerability discovery isn’t well known only indicates that I’m not marketed deliberately to those who are buying. Anthropoid ain’t all that. It’s the Sun of chatbots, won’t last a decade. Garage ai is better than commercial frontier ai. There, I said it.
"My personal conclusion can however not end up with anything else than that the big hype around this model so far was primarily marketing. I see no evidence that this setup finds issues to any particular higher or more advanced degree than the other tools have done before Mythos. Maybe this model is a little bit better, but even if it is, it is not better to a degree that seems to make a significant dent in code analyzing."
It's a good reminder for us all that the competition in this space is rough and lots of more or less subtle marketing is involved.
The other alternative is that Curl is simply secure enough that there was far less to find than in other projects.
About as subtle as a personal injury lawyer's billboard
It's almost Trump-esque - "this model will change everything forever; we are doomed; we are saved; we will all be fired; we will all be rich", etc
They need the hype to pay off way more than we do. So many of us who still write code directly stand to lose nothing of our capabilities if the marketing claims cannot hold water.
I'm surprised you say that because it is all over Hacker News. Every single post is co-opted into promoting AI. Try finding a submission with fifty points or more than doesn't have AI or LLM's mentioned somewhere in the comments.
This. Well done by Antropic.
It even reached the CISO of my small semi-government org in the Netherlands, who slightly panicked at the announced 'tsunami' of vulnerabilities that was coming with Mythos.
Got us some more money and priority with the board, though.
Never waste a good marketing scare.
I'm not sure that follows. As noted, curl was already analyzed to death with every tool available; most software isn't at that level.
If so, it would still follow. "Most software" isn't analyzed as much as curl, by either other tooling or other models, that might well find close to the same as Mythos did. As such, Mythos then isn't especially/particularly dangerous.
> It’s not that I would have a lot of time to explore lots of different prompts and doing deep dive adventures anyway.
His expertise I think would elevate the results quite a bit. Although if he never uses LLMs, which it reads like he doesn't, I guess it might backfire just as well. Prompting style (still?) does matter after all, certainly in my experience anyways.
Also curl in this regard is a open source project, relativly small but critical, well known and used everywhere. Besides image libraries, tools like curl or sudo, su, passwd, etc. would also be my first try.
Mythos is still not known at all what it can do. What does it mean from cost and benchmark pov to have a 10 Trillion parameter model?
Nonetheless, the fact that LLMs got significant better in finding this, better than humans, started to happen half a year ago? so at one point we need to address the elefant in the room and state that today you need to do security scanning additional with LLMs. You need to take this serious.
In worst case, use Anthropics marketing to state that its a must now and something changed.
*rools eyes* regular static analyzers also have been "better than humans" for decades, being better than a human at a specific mechanical task really doesn't mean much. The interesting new thing is the type of potential "fuzzy bugs" described in the article that LLMs are able to identify (a comment not matching the code it describes, uncommon usage of a 3rd party library, mismatch of code and a protocol it implements, or often just generally weird looking code somebody should have a closer look at... this closes a gap in the traditional debugging toolboxes, but shouldn't replace them)
My mind still cannot understand the quality and refinement that's gone into cURL. It really is the perfect example of something done so right, that people barely think twice about.
However in the days of race to bottom, offshoring for penies, and now LLM powered code generation, this is a quality most companies won't care unless there is liability in place.
[0] https://tsz.dev
You just confirmed that you didn't read the article.
"Eventually, I was instead offered that someone else, who has access to the model, could run a scan and analysis on curl for me using Mythos and send me a report."
The point wasn't actual cross-platform portability even though that was a nice side effect. It was to flush out all the weird edge cases.
Edges like security flaws. Buffer overflows are usually platform specific. There are plenty of other ways to find these issues but simply recompiling for a different platform surfaces all sorts of issues.
Typo, or is there a spoof I should go read?
Does it say anything else? Just 'Aaaarggghhhh'?
Source: voice typing this with Swedish vocal chords, and only had to correct "different lives" to "differently", and add /[^\w\s]/.
I also thought they were contending the word count before noticing. Even remarked how I find this a weird metric, given that code is not prose [0], but then I deleted that once I picked up on what's going on.
[0] comparing the output of `wc -w` with the word counts of books I'm reasonably sure will be super off