• WalnutLum@lemmy.ml
    link
    fedilink
    arrow-up
    58
    ·
    5 months ago

    The Blog Post from the researcher is a more interesting read.

    Important points here about benchmarking:

    o3 finds the kerberos authentication vulnerability in the benchmark in 8 of the 100 runs. In another 66 of the runs o3 concludes there is no bug present in the code (false negatives), and the remaining 28 reports are false positives. For comparison, Claude Sonnet 3.7 finds it 3 out of 100 runs and Claude Sonnet 3.5 does not find it in 100 runs.

    o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it. More interestingly however, in the output from the other runs I found a report for a similar, but novel, vulnerability that I did not previously know about. This vulnerability is also due to a free of sess->user, but this time in the session logoff handler.

    I’m not sure if a signal to noise ratio of 1:100 is uh… Great…

    • drspod@lemmy.ml
      link
      fedilink
      arrow-up
      24
      ·
      5 months ago

      If the researcher had spent as much time auditing the code as he did having to evaluate the merit of 100s of incorrect LLM reports then he would have found the second vulnerability himself, no doubt.

      • beleza pura@lemmy.eco.br
        link
        fedilink
        arrow-up
        8
        ·
        5 months ago

        this confirms what i just said in reply to a different comment: most cases of ai “success” are actually curated by real people from a sea of bullshit

      • ddh@lemmy.sdf.org
        link
        fedilink
        English
        arrow-up
        5
        arrow-down
        3
        ·
        5 months ago

        And if Gutenberg had just written faster, he would’ve produced more books in the first week?

        • WalnutLum@lemmy.ml
          link
          fedilink
          arrow-up
          5
          ·
          5 months ago

          I’m not sure if the Gutenberg Press had only produced one readable copy for every 100 printed it would have been the literary revolution that it was.

          • ddh@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            1
            arrow-down
            2
            ·
            5 months ago

            I agree not brilliant, but It’s early days. If one is looking to mechanise a process like finding bugs, you have to start somewhere. Determine how to measure success, set performance baselines and all that.

        • ThirdConsul@lemmy.ml
          link
          fedilink
          arrow-up
          3
          ·
          5 months ago

          I get your point, but your comparison is a little… off. Wasn’t Gutenberg “printing”, not “writing”?

          • ddh@lemmy.sdf.org
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            1
            ·
            edit-2
            5 months ago

            You’re right, probably better put as: if he’d spent his time writing instead of working on that contraption, he’d have produced more books in the first month.

      • irotsoma@lemmy.blahaj.zone
        link
        fedilink
        arrow-up
        2
        ·
        5 months ago

        Problem is motivation. As someone with ADHD I definitely understand that having an interesting project makes tedious stuff much more likely to get done. LOL

    • FauxLiving@lemmy.world
      link
      fedilink
      arrow-up
      6
      arrow-down
      2
      ·
      5 months ago

      I’m not sure if a signal to noise ratio of 1:100 is uh… Great…

      It found it correctly in 8 of 100 runs and reported a find that was false in 28 runs. The remaining 64 runs can be discarded, so a person would only need to review 36 reports. For the LLM, 100 runs would take minutes at most, so the time requirement for that is minimal and the cost would be trivial compared to the cost of 100 humans learning a codebase and writing a report.

      So, a security research puts in the code base and in a few minutes they have 36 bug reports that they need to test. If they know that 2 in 9 of them are real zero-day exploits then discovering new zero-days becomes a lot faster.

      If a security researcher had the option of reading an entire code base or reviewing 40 bug reports, 10 of which would contain a new bug then they would choose the bug reports every time.

      That isn’t to say that people should be submitting LLM generated bug reports to developers on github. But as a tool for a security researcher to use it could significantly speed up their workflow in some situations.

      • WalnutLum@lemmy.ml
        link
        fedilink
        arrow-up
        7
        ·
        5 months ago

        It found it 8/100 times when the researcher gave it only the code paths he already knew contained the exploit. Essentially the garden path.

        The test with the actual full suite of commands passed in the context only found it 1/100 times and we didn’t get any info on the number of false positives they had to wade through to find it.

        This is also assuming you can automatically and reliably filter out false negatives.

        He even says the ratio is too high in the blog post:

        That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution.

        • FauxLiving@lemmy.world
          link
          fedilink
          arrow-up
          2
          arrow-down
          2
          ·
          5 months ago

          From the blog post: https://sean.heelan.io/2025/05/22/how-i-used-o3-to-find-cve-2025-37899-a-remote-zeroday-vulnerability-in-the-linux-kernels-smb-implementation/

          That is quite cool as it means that had I used o3 to find and fix the original vulnerability I would have, in theory, done a better job than without it. I say ‘in theory’ because right now the false positive to true positive ratio is probably too high to definitely say I would have gone through each report from o3 with the diligence required to spot its solution. Still, that ratio is only going to get better.

          Conclusion

          LLMs exist at a point in the capability space of program analysis techniques that is far closer to humans than anything else we have seen. Considering the attributes of creativity, flexibility, and generality, LLMs are far more similar to a human code auditor than they are to symbolic execution, abstract interpretation or fuzzing. Since GPT-4 there has been hints of the potential for LLMs in vulnerability research, but the results on real problems have never quite lived up to the hope or the hype. That has changed with o3, and we have a model that can do well enough at code reasoning, Q&A, programming and problem solving that it can genuinely enhance human performance at vulnerability research.

          o3 is not infallible. Far from it. There’s still a substantial chance it will generate nonsensical results and frustrate you. **What is different, is that for the first time the chance of getting correct results is sufficiently high that it is worth your time and and your effort to try to use it on real problems. **

          The point is that LLM code review can find novel exploits. The author gets results using a base model with a simple workflow so there is a lot of room for improving the accuracy and outcomes in such a system.

          A human may do it better on an individual level but it takes a lot more time, money and effort to make and train a human than it does to build an H100. This is why security audits are long, manual and expensive process which requires human experts. Because of this, exploits can exist in the wild for long periods of time because we simply don’t have enough people to security audit every commit.

          This kind of tool could make security auditing a checkbox in your CI system.

          • WalnutLum@lemmy.ml
            link
            fedilink
            arrow-up
            3
            ·
            5 months ago

            There’s a lot of assumptions about the reliability of the LLMs to get better over time laced into that…

            But so far they have gotten steadily better, so I suppose there’s enough fuel for optimists to extrapolate that out into a positive outlook.

            I’m very pessimistic about these technologies and I feel like we’re at the top of the sigma curve for “improvements,” so I don’t see LLM tools getting substantially better than this at analyzing code.

            If that’s the case I don’t feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.

            • FauxLiving@lemmy.world
              link
              fedilink
              arrow-up
              2
              arrow-down
              2
              ·
              5 months ago

              We only know if we’re at the top of the curve if we keep pushing the frontier of what is possible. Seeing exciting paths is what motivates people to try to get the improvements and efficiencies.

              I do agree that the AI companies are pushing a ridiculous message, as if LLMs are going to replace people next quarter. I too am very pessimistic on that outcome, I don’t think we’re going to see LLMs replacing human workers anytime soon. Nor do I think GitHub should make this a feature tomorrow.

              But, machine learning is a developing field and so we don’t know what efficiencies are possible. We do know that you can create intelligence out of human brains so it seems likely that whatever advancements we make in learning would be at least in the direction of the efficiency of human intelligence.

              If that’s the case I don’t feel like having hundreds and hundreds of false security reports creates the mental arena that allows for researchers to actually spot the non-false report among all the slop.

              It could very well be that you can devise a system which can verify hundreds of false security reports easier than a human can audit the same codebase. The author didn’t explore how he did this but he seems to have felt that it was worth his time.:

              What is different, is that for the first time the chance of getting correct results is sufficiently high that it is worth your time and and your effort to try to use it on real problems.

    • PushButton@lemmy.world
      link
      fedilink
      arrow-up
      3
      ·
      5 months ago

      It’s only good for clickbait titles.

      It brings clicks and it’s spreading the falsehood that “AI” is good at something/getting better for the majority of people who stop at the title.

  • some_guy@lemmy.sdf.org
    link
    fedilink
    arrow-up
    43
    ·
    5 months ago

    I’m skeptical of this. The primary maintainer of curl said that all of their AI bug submissions have been bunk and wasted their time. This seems like a lucky one-off rather than anything substantial.

    • Evotech@lemmy.world
      link
      fedilink
      arrow-up
      14
      ·
      5 months ago

      Of course, if you read the article you’ll see that the model found the bugk 8 out of 100 attempts.

      It was prompted what type of issue to look for.

      • some_guy@lemmy.sdf.org
        link
        fedilink
        arrow-up
        2
        ·
        5 months ago

        I meant one-off that it worked on this code base rather than how many times it found the issue. I don’t expect it to work eight out of a hundred times on any and all projects.

    • beleza pura@lemmy.eco.br
      link
      fedilink
      arrow-up
      14
      arrow-down
      1
      ·
      5 months ago

      this summarizes most cases of ai “success”. people see generative ai generating good results once and then extrapolate that they’re able to consistently generate good results, but the reality is that most of what it generates is bullshit and the cases of success are a minority of the “content” ai is generating, curated by actual people

      • GnuLinuxDude@lemmy.ml
        link
        fedilink
        arrow-up
        4
        ·
        5 months ago

        Curated by experts, specifically. Seeing a lot of people use this stuff and flop, even if they’re not doing it with any intention to spam.

        I think the curl project gets a lot of spam because 1) it has a bug bounty with a payout and 2) kinda fits with CVE bloat phenomenon where people want the prestige of “discovering” bugs so that they can put it on their resumes to get jobs, or whatever. As usual, the monetary incentive is the root of the evil.

  • Luffy@lemmy.ml
    link
    fedilink
    arrow-up
    26
    arrow-down
    2
    ·
    5 months ago

    TL;DR: The pentester already found it himself, and wanted to test how offen GPT finds it if he pasts that code into it

    • 8uurg@lemmy.world
      link
      fedilink
      English
      arrow-up
      7
      arrow-down
      3
      ·
      5 months ago

      Not quite, though. In the blogpost the pentester notes that it found a similar issue (that he overlooked) that occurred elsewhere, in the logoff handler, which the pentester noted and verified when spitting through a number of the reports it generated. Additionally, the pentester noted that the fix it supplied accounted for (and documented) a issue that it accounted for, that his own suggested fix for the issue was (still) susceptible to. This shows that it could be(come) a new tool that allows us to identify issues that are not found with techniques like fuzzing and can even be overlooked by a pentester actively searching for them, never mind a kernel programmer.

      Now, these models generate a ton of false positives, which make the signal-to-noise ratio still much higher than what would be preferred. But the fact that a language model can locate and identify these issues at all, even if sporadically, is already orders of magnitude more than what I would have expected initially. I would have expected it to only hallucinate issues, not finding anything that is remotely like an actual security issue. Much like the spam the curl project is experiencing.

      • Luffy@lemmy.ml
        link
        fedilink
        arrow-up
        9
        arrow-down
        1
        ·
        5 months ago

        Yes, but:

        To get to this point, OpenAI had to suck up almost all data ever generated in the world. So in order for it to become better, lets say it has to have 3 times as much data. That alone would take more than 3 Lifetimes to get the data alone, IF we don´t consider the AI slop and assume that all data is still Human made, which is just not true.

        In other words: What you describe will just about never happen anymore, at least as long as 2025 will still be remembered

        • 8uurg@lemmy.world
          link
          fedilink
          English
          arrow-up
          5
          arrow-down
          1
          ·
          5 months ago

          Yes, true, but that is assuming:

          1. Any potential future improvement solely comes from ingesting more useful data.
          2. That the amount of data produced is not ever increasing (even excluding AI slop).
          3. No (new) techniques that makes it more efficient in terms of data required to train are published or engineered.
          4. No (new) techniques that improve reliability are used, e.g. by specializing it for code auditing specifically.

          What the author of the blogpost has shown is that it can find useful issues even now. If you apply this to a codebase, have a human categorize issues by real / fake, and train the thing to make it more likely to generate real issues and less likely to generate false positives, it could still be improved specifically for this application. That does not require nearly as much data as general improvements.

          While I agree that improvements are not a given, I wouldn’t assume that it could never happen anymore. Despite these companies effectively exhausting all of the text on the internet, currently improvements are still being made left-right-and-center. If the many billions they are spending improve these models such that we have a fancy new tool for ensuring our software is more safe and secure: great! If it ends up being an endless money pit, and nothing ever comes from it, oh well. I’ll just wait-and-see which of the two will be the case.

  • ctrl_alt_esc@lemmy.ml
    link
    fedilink
    arrow-up
    30
    arrow-down
    8
    ·
    5 months ago

    This means absolutely nothing. It scanned a large amount of text and found something. Great, that’s exactly what it’s supposed to do. Doesn’t mean it’s smart or getting smarter.

    • 柊 つかさ@lemmy.world
      link
      fedilink
      arrow-up
      12
      arrow-down
      6
      ·
      5 months ago

      People often dismiss AI capabilities because “it’s not really smart”. Does that really matter? If it automates everything in the future and most people lose their jobs (just an example), who cares if it is “smart” or not? If it steals art and GPL code and turns a profit on it, who cares if it is not actually intelligent? It’s about the impact AI has on the world, not semantics on what can be considered intelligence.

      • nyan@sh.itjust.works
        link
        fedilink
        arrow-up
        5
        ·
        5 months ago

        It matters, because it’s a tool. That means it can be used correctly or incorrectly . . . and most people who don’t understand a given tool end up using it incorrectly, and in doing so, damage themselves, the tool, and/or innocent bystanders.

        True AI (“general artificial intelligence”, if you prefer) would qualify as a person in its own right, rather than a tool, and therefore be able to take responsibility for its own actions. LLMs can’t do that, so the responsibility for anything done by these types of model lies with either the person using it (or requiring its use) or whoever advertised the LLM as fit for some purpose. And that’s VERY important, from a legal, cultural, and societal point of view.

        • 柊 つかさ@lemmy.world
          link
          fedilink
          arrow-up
          2
          ·
          5 months ago

          Ok, good point. It also matters if AI is true intelligence or not. What I meant was the comment I replied to said

          This means absolutely nothing.

          Like if it is not true AI nothing it does matters? The effects of the tool, even if not true AI, matters a lot.

      • beleza pura@lemmy.eco.br
        link
        fedilink
        arrow-up
        3
        ·
        5 months ago

        i feel like people are misunderstanding your point. yes, generative ai is bullshit, but it doesn’t need to be good in order to replace workers

      • ctrl_alt_esc@lemmy.ml
        link
        fedilink
        arrow-up
        2
        ·
        5 months ago

        I don’t know if you read the article, but in there it says AI is becoming smarter. My comment was a response to that.

        Irrespective of that, you raise an interesting point “it’s about the impact AI has on the world”. I’d argue it’s real impact is quite limited (mind you I’m referring to generative AI and specifically LLMs rather than AI generally), it has a few useful applucations, but the emphasis here is on few. However, it’s being pushed by all the big tech companies and those lobbying for them as the next big thing. That’s what’s really leading to the “impact” you’re perceiving.

    • atzanteol@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      6
      arrow-down
      6
      ·
      5 months ago

      It scanned a large amount of text and found something.

      How hilariously reductionist.

      AI did what it’s supposed to do. And it found a difficult to spot security bug.

      “No big deal” though.

  • balsoft@lemmy.ml
    link
    fedilink
    arrow-up
    18
    ·
    5 months ago

    I’m surprised it took this long. The world is crazy over AI, meaning everyone and their grandma is likely trying to do something like this right now. The fact it took like 3 years for an actual vulnerability “discovered by AI” (actually it seems it was discovered by the researcher filtering out hundreds of false positives?) tells me it sucks ass at this particular task (it also seems to be getting worse, judging by the benchmarks?)

    • DonutsRMeh@lemmy.world
      link
      fedilink
      arrow-up
      1
      arrow-down
      10
      ·
      5 months ago

      All ai is is a super fast web search with algorithms for some reasoning. It’s not black magic.

      • balsoft@lemmy.ml
        link
        fedilink
        arrow-up
        10
        arrow-down
        1
        ·
        edit-2
        5 months ago

        No, it’s not. It’s a word predictor trained on most of the web. On its own it’s a pretty bad search engine because it can’t reliably produce the training data (that would be overfitting). What it’s kind of good at is predicting what the result would look like if someone asked a somewhat novel question. But then it’s not that good at producing the actual answer to that question, only imitating what the answer would look like.

        • DonutsRMeh@lemmy.world
          link
          fedilink
          arrow-up
          1
          arrow-down
          7
          ·
          5 months ago

          100%. It’s a super fast web crawler. These are buzz words capitalists throw around to make some more money. I don’t know if you’ve heard of the bullshit that anthropic was throwing around about claude threatening to “blackmail” employees if they took it offline. Lmao.

          • Melmi@lemmy.blahaj.zone
            link
            fedilink
            English
            arrow-up
            7
            ·
            5 months ago

            Calling it a web crawler is just innacurate. You can give it access to a web search engine, which is how the “AI search engines” work, but LLMs can’t access the internet on their own. They’re completely self-contained unless you give them tools that let them do other things.

  • ⲇⲅⲇ@lemmy.ml
    link
    fedilink
    arrow-up
    10
    ·
    5 months ago

    literaly says “o3 finds the kerberos authentication vulnerability in 1 out of 100 runs with this larger number of input tokens, so a clear drop in performance, but it does still find it.” on the original author…

  • NoiseColor @lemmy.world
    link
    fedilink
    arrow-up
    10
    ·
    5 months ago

    I don’t get it, I use o3 a lot and I couldn’t get it to even make a simple developed plan.

    I haven’t used it for coding, but other stuff I often get better results with o4.

    I don’t get what they call reasoning with it.

  • biofaust@lemmy.world
    link
    fedilink
    arrow-up
    4
    ·
    5 months ago

    I have read the threads up to now and, despite being ignorant about security research, I would call myself convinced of the usefulness of such a tool in the near-future to shave off time in the tasks required for this kind of work.

    My problem with this is that transformer-based LLMs still don’t sound to me like the good tool for the job when it comes to such formal languages. It is surely a very expensive way to do this job.

    Other architectures are getting much less attention because of this the focus of investors on this shiny toy. From my understanding, neurosymbolic AI would do a much better and potentially faster job at a task involving stable concepts.

  • WalnutLum@lemmy.ml
    link
    fedilink
    arrow-up
    8
    arrow-down
    7
    ·
    5 months ago

    This would feel a lot less gross if this had been with an open model like deepseek-r1.

  • utopiah@lemmy.ml
    link
    fedilink
    arrow-up
    4
    arrow-down
    3
    ·
    edit-2
    5 months ago

    Looks like another of those “Asked AI to find X. AI does find X as requested. Claims that the AI autonomously found X.”

    I mean… the program literally does what has been asked and its dataset includes examples related to the request.

    Shocked Pikachu face? Really?

    • Revan343@lemmy.ca
      link
      fedilink
      arrow-up
      5
      ·
      edit-2
      5 months ago

      The shock is that it was successful in finding a vulnerability non already known to the researcher, at a time when LLMs aren’t exactly known for reliability

      • utopiah@lemmy.ml
        link
        fedilink
        arrow-up
        1
        arrow-down
        2
        ·
        edit-2
        5 months ago

        Maybe I misunderstood but the vulnerability was unknown to them but the class of vulnerability, let’s say “bugs like that”, are well known and published by the security community, aren’t there?

        My point being that if it’s previously unknown and reproducible (not just “luck”) is major, if it’s well known in other projects, even though unknown to this specific user, then it’s unsurprising.

        Edit: I’m not a security researcher but I believe there are already a lot of tools doing static and dynamic analysis. IMHO It’d be helpful to know how those perform already versus LLMs used here, namely across which dimensions (reliability, speed, coverage e.g. exotic programming languages, accuracy of reporting e.g. hallucinations, computation complexity and thus energy costs, openness, etc) is each solution better or worst than the other. I’m always wary of “ex nihilo” demonstrations. Apologies if there is benchmark against existing tools and if I missed that.