Gen AI is a HUGE cybersecurity problem!
Show notes
Full Article: https://www.anthropic.com/news/disrupting-AI-espionage Website: https://maximilian-schwarzmueller.com/
Socials: 👉 Twitch: https://www.twitch.tv/maxedapps 👉 X: https://x.com/maxedapps 👉 Udemy: https://www.udemy.com/user/maximilian-schwarzmuller/ 👉 LinkedIn: https://www.linkedin.com/in/maximilian-schwarzmueller/
Want to become a web developer or expand your web development knowledge? I have multiple bestselling online courses on React, Angular, NodeJS, Docker & much more! 👉 https://academind.com/courses
Show transcript
00:00:00: A disturbing, yet not really surprising article or
00:00:03: post was published by Anthropic yesterday.
00:00:07: An article about an AI-orchestrated
00:00:11: cyber espionage campaign, a cyberattack
00:00:15: carried out with help of AI, with help of Claude code.
00:00:17: And it's an interesting article. I already read it
00:00:21: together with you here. So Anthropic, in this article,
00:00:25: describes a, a cyberattack on
00:00:28: various companies that was carried out almost entirely with
00:00:32: help of AI, almost entirely with help of Claude code,
00:00:36: by jailbreaking Claude code, by getting it to do
00:00:40: things it normally shouldn't do. And let's take a closer look at
00:00:44: how that played out. "In mid-September 2025,
00:00:48: we, Anthropic, detected suspicious activity that later,
00:00:52: investigation determined to be a highly sophisticated espionage campaign.
00:00:56: The attackers used AI's agentic capabilities to an
00:01:00: unprecedented degree, using AI not just as an advisor, but to
00:01:04: execute the cyberattacks themselves." And that's, that
00:01:07: is huge that you can
00:01:11: already use AI, AI that is
00:01:14: published by Anthropic, AI models by Anthropic, and the
00:01:18: Claude code tool by Anthropic to carry out
00:01:22: cyberattacks. And, uh, here's why that is
00:01:25: important. Today we're living in a world where
00:01:29: the most capable models are the
00:01:32: models by OpenAI, Anthropic, X,
00:01:35: Google, and of course we've got some very capable
00:01:39: too, but we're still at a point in time on a,
00:01:43: timeline, we are still at a point in time
00:01:47: where most of these models and the, the most capable
00:01:51: models are controlled by companies,
00:01:55: most of them by companies in democracies.
00:01:57: Now, I'll not say that these companies are saints or
00:02:02: don't do bad stuff, but they're not bad
00:02:06: actors in the sense of this article.
00:02:09: They're not cyber, um, attackers
00:02:12: obviously. However, in the future, in the not too
00:02:16: distant future, we will be at a place
00:02:19: where these models, these very capable models
00:02:24: will also be owned by bad actors
00:02:27: themselves. So right now here in this article,
00:02:31: about a cyberattack that was carried out with help of
00:02:35: Anthropic and Claude code. And it's bad enough that this is possible,
00:02:39: trick those tools and models into doing stuff they shouldn't
00:02:43: we'll get back to how that happened, of course, but that is
00:02:46: today. We still have to apply tricks,
00:02:50: uh, to abuse these models. In a
00:02:53: future, these models will simply belong to the bad
00:02:57: actors themselves. There will be open models that are capable
00:03:01: enough of doing that, so then certain control mechanisms,
00:03:05: get back here in this article, won't even be there anymore.
00:03:09: We'll be in a future where bad actors have
00:03:13: direct, uncontrolled access to very capable models
00:03:17: can be fine-tuned for their purposes, that can be
00:03:21: trained for their purposes, that can use tools that were
00:03:25: purpose built for malicious stuff.
00:03:27: That is where we're heading to, and even today where we're not there
00:03:31: yet or where this is still a niche, even today, we
00:03:35: have fully or almost fully AI-controlled,
00:03:39: uh, cyberattacks. So this is definitely a scary article
00:03:43: and, and a scary future, which makes it very clear that
00:03:47: cybersecurity and
00:03:49: preventing attacks will be a super
00:03:53: big challenge. It always has been, but now with AI where everything can be
00:03:57: quicker and more automated and harder to trace back to
00:04:00: individuals, it will be an even more important
00:04:04: topic. But back to this article. "The threat
00:04:07: actor, whom we assess with high confidence
00:04:11: state-sponsored group, manipulated our Claude code
00:04:15: tool into attempting infiltration into roughly 30
00:04:19: global targets and succeeded in a small number of cases." So it was not just
00:04:23: an attempt. They succeeded. "The operation
00:04:27: targeted large tech companies, financial institutions,
00:04:31: companies, and government agencies.
00:04:34: Uh, we believe this is the first documented case of a
00:04:37: cyberattack executed without substantial human intervention." This is so
00:04:41: big, without substantial human
00:04:43: intervention. And again, that is why this is
00:04:47: such a scary future where bad actors don't even
00:04:51: need to rely on, um, these controlled
00:04:55: AI models, uh, which they still have to rely on today.
00:04:58: "Upon detecting this activity, we immediately launched an investigation
00:05:02: understand its scope and nature. Uh, over the following 10 days, as we
00:05:06: mapped the severity and full extent of the operation,
00:05:09: they were identified." So again, these were really regular
00:05:13: Claude accounts. These people were using
00:05:16: Claude, the model hosted by Anthropic.
00:05:19: They did not kind of steal it, run it on their own servers
00:05:23: models. They used the models you can use too via
00:05:27: via Claude code. This campaign has substantial
00:05:31: implications for cybersecurity in the age of AI agents, as I just said,
00:05:35: because everything can be automated and it's already possible today, uh,
00:05:39: where there are guardrails in place.
00:05:42: And again, think of that future where we have no guardrails
00:05:45: actors don't have guardrails. "Uh, these attacks are likely
00:05:49: to only grow in their effectiveness.
00:05:51: To keep pace with this rapidly advancing threat, we have expanded our
00:05:55: detection capabilities and developed better classifiers to flag
00:05:59: activities." And that's the important part here.This is what
00:06:03: Anthropic is trying to do today to make sure that their
00:06:06: models can't be abused for malicious tasks,
00:06:10: that Claude code can't be abused and ultimately,
00:06:14: their APIs, which Claude code uses in the end.
00:06:18: This all won't matter at all in the future because
00:06:22: in a future where bad actors themselves have their own models
00:06:25: running on their own servers, these guardrails won't
00:06:29: matter. Well, obviously they will still matter.
00:06:31: Obviously, you still want to make it easier than it's
00:06:35: obviously not all bad actors will have their
00:06:39: own malicious models, but especially if we're talking about
00:06:42: state-controlled bad actors or big
00:06:46: cyber attacker groups. Let's, let's be
00:06:50: real. Of course they will have access to their
00:06:54: on their own servers, so this will not matter at
00:06:57: all in that future. Obviously, it will matter still because you don't
00:07:01: make it easier and it will at least filter out a significant
00:07:05: group of potential bad actors that don't have access to their own
00:07:09: models. So yeah, it's important, but it will not be enough.
00:07:13: Companies themselves need to ramp up
00:07:17: their cybersecurity game, which is easier said than done because
00:07:21: it has been true for the last 10 years, of course, even without AI, but it's
00:07:25: becoming even more important in the age of AI.
00:07:30: So yeah, that is, that is the big problem here.
00:07:33: Now, let's see how the, uh, cyberattack worked.
00:07:37: Uh, the attack relied on several features of
00:07:40: exist or were in much more nascent form just a year ago.
00:07:44: Intelligence: Models' general levels of capability have
00:07:48: point that they can follow complicance- complex instructions and understand
00:07:52: context in ways that make very sophisticated tasks possible.
00:07:56: Not only that, but several of their well-developed
00:07:59: particular software coding, lend themselves to being used in
00:08:03: Sure, because now with models that are smarter,
00:08:07: and just to be very clear here, we're still talking about models that
00:08:10: just generate tokens, but of course by generating these tokens
00:08:14: that are able, they are able to describe the usage of tools
00:08:18: tools, they become more capable. With all the
00:08:22: fine-tuning they received, they also generate more tokens that are
00:08:26: likely to be or more likely to be the tokens you want to generate, so that
00:08:30: is what intelligence means here. They're not really intelligent, but they
00:08:34: have been tuned especially for software development such that
00:08:38: they are much more likely to generate meaningful output
00:08:42: and especially also output that allows them to describe tool
00:08:46: use and then use those tools, so execute code
00:08:49: that does something. Uh, for example, send an HTTP request and so
00:08:53: on, and it's that combination that makes them more capable
00:08:57: in the end. And of course, yeah, that's exactly what you need for automated
00:09:01: cyberattacks, because you need a model that's able to
00:09:05: follow your instructions related to that.
00:09:07: You need a model that's able to send HTTP requests, phishing
00:09:10: emails, whatever, and all that is what these models can do
00:09:14: quite well in the end. Agency: Models can act as agents.
00:09:18: That is, they can run in loops where they take
00:09:21: tasks and make decisions with only minimal occasional human
00:09:24: input. That's another important step.
00:09:27: This is what allows these, um, systems or these
00:09:31: attacks here in this case to work with only minimal human input
00:09:36: because these models and the software that uses
00:09:40: these models can go for so much longer and it's important to
00:09:43: differentiate here. The model is, is still just the
00:09:47: thing that receives a prompt and sends back some tokens.
00:09:51: That, that has not changed. But it's the software, Claude Code, for
00:09:55: example, that then takes that output and sends back another
00:09:59: message to the same API with that output, with the original
00:10:02: task, with some meta instructions like, "Please check if that
00:10:06: output answers the question by the user.
00:10:09: You got these tools available, please tell me if you want to use a tool."
00:10:12: how the software around the models in the end makes these
00:10:16: models more capable, not because the model does everything on
00:10:20: the model is capable of giving the software the
00:10:24: result it needs. The software then feeds these enriched
00:10:28: the model and it's this loop that keeps the whole system going and that
00:10:32: leads to these agentic systems that can go
00:10:36: on for longer, that can use tools and that require less
00:10:39: human input. And yeah, tools, that is therefore the other missing piece
00:10:43: here, of course, that models have access to a wide array of
00:10:47: tools. They can now search the web, retrieve data, perform many other
00:10:50: actions that were previously the sole domain of human operators.
00:10:53: In the case of cyberattacks, the tools might include password crackers,
00:10:57: scanners and other security-related software.
00:11:00: Because again, it's not all just GitHub MCPs.
00:11:04: It can be all kind of tools you could, uh, expose
00:11:07: to your, um, model or to the software that uses
00:11:11: these models and that runs these agentic tasks.
00:11:14: So they got a nice diagram in this article, but in the end
00:11:18: the attack played out relatively, uh,
00:11:21: simple. They, they describe it in greater detail down
00:11:25: there. They convinced Claude Code to do
00:11:29: stuff it normally shouldn't be able to do.
00:11:31: They had to convince Claude, which is extensively trained to avoid harmful
00:11:34: behaviors, to engage in the attack.
00:11:36: They did so by jailbreaking it, effectively tricking it to bypass
00:11:40: its guardrails, and that's the part where Anthropic
00:11:44: back better, not just in Claude Code but in the
00:11:48: models themselves, so on their API where they scan all the
00:11:52: requests that reach their models, so to say, and take
00:11:56: better... eh, they try to do a better job at detecting
00:12:00: injections in the end because these attackers broke down their attacks into
00:12:04: small, seemingly innocent tasks that Claude would execute without being
00:12:07: provided the full context of their malicious purpose.
00:12:10: They also told Claude that it was an employee of a legitimate
00:12:14: cybersecurity firm and was being used in defensive testing.That's
00:12:18: good old trick. I think that is how jailbreaking was already done two
00:12:22: years ago with the early ChatGPT models.
00:12:25: Eh, you tell it that you need this information
00:12:29: and it'll happily expose its system prompt.
00:12:33: Kind of a simplification but that's still how prompt injections can
00:12:36: days. That you try to apply various techniques
00:12:40: are very interesting techniques when it comes to that,
00:12:44: eh, including the use of special tokens you
00:12:48: message to tr- to get the
00:12:51: model to generate output it normally shouldn't
00:12:54: generate. The attackers then initiated the second
00:12:58: involved Claude code ins- inspecting the target organization's
00:13:01: So yeah, that's then essentially what Claude code
00:13:04: did. Then with minimal human input, um, it
00:13:08: is in the end then used its agentic capabilities, its
00:13:12: tools, to really, um,
00:13:15: scan networks, write code, and do all that
00:13:19: stuff without a human telling it exactly what to do.
00:13:23: So, it was, as mentioned earlier, uh, a fully or almost fully
00:13:27: automated attack. So, in the next phases of the attack, Claude
00:13:31: identified and tested security vulnerabilities in the target
00:13:35: systems by researching and writing its own exploit
00:13:38: code. Having done so, the framework was able to use Claude to
00:13:42: harvest credentials, usernames and passwords that allowed it to further
00:13:45: then extract a large amount of private data.
00:13:48: So, it did really research, write the code to
00:13:52: get into systems, uh, of other companies and then in those
00:13:56: systems write more code to extract data,
00:13:59: um, and- and- and compromise these systems and- and- and
00:14:03: do bad stuff in there once it was in there.
00:14:07: The highest privileged accounts were identified, backdoors
00:14:11: all the stuff that happened after it was in the systems and data were exfil-
00:14:14: exfiltrated with minimal human supervision.
00:14:18: In a final phase, the attackers had Claude produce
00:14:21: the attack, creating helpful files with the stolen
00:14:25: analyzed which would assist the framework in planning the next stage of the
00:14:28: threats actor, eh, of the threat actor's cyber operations.
00:14:32: Overall, the threat actor was able to use AI to perform 80 to 90% of
00:14:36: campaign with human intervention required only sporadically, perhaps four to
00:14:40: six critical decision points per hacking campaign.
00:14:42: That is nothing. That is nothing. That is such
00:14:47: a scale at which you can run these attacks and again, especially in a
00:14:50: future where you don't have to work against certain guardrails, where you can
00:14:54: just focus on getting the job done and you have to spend
00:14:58: energy on getting around guardrails.
00:15:00: That is really a scary future. This degree of
00:15:03: automation is really, really, uh, scary
00:15:07: here. Claude didn't always work perfectly.
00:15:09: It occasionally hallucinated credentials or claimed to have extracted secret
00:15:13: information that was in fact publicly available.
00:15:15: This remains an obstacle to fully, uh, autonomous cyber attacks and this is
00:15:19: not just an obstacle for cyber attacks, this is of course an obstacle for
00:15:22: everybody, for us developers too because
00:15:26: problem and will stay a problem because as I mentioned before, it's so
00:15:30: easy to forget but these are token generation
00:15:34: machines. Always have been, always will be.
00:15:36: The large language models, I mean.
00:15:38: They are generating tokens and they are generating
00:15:42: the most likely token as the next token based on all the
00:15:46: tokens that came before it, and that is something that can and
00:15:50: always will have the danger of
00:15:53: hallucinating. So, that of course is a problem in
00:15:56: general. Good to see that it can then also be helpful
00:16:00: when it comes to defending against malicious tasks because
00:16:04: those also are hurt by hallucination but of course that is a general
00:16:08: problem, uh, we face, uh, and it will not be a
00:16:11: significant, uh, defense mechanism
00:16:14: unfortunately. Because in the end if everything's automated, it's
00:16:18: just a question of scale and if some attacks fail because of
00:16:22: hallucination, well, does that really matter if you can run
00:16:26: thousands of attacks in parallel?
00:16:28: I'm not sure it does. Cybersecurity implications.
00:16:31: The barriers to performing sophisticated cyber attacks have dropped
00:16:35: substantially and we predict that they'll continue to do so.
00:16:38: With the correct setup, threat actors can now use agentic AI
00:16:42: extended periods to do the work of entire teams of experienced hackers.
00:16:46: So kind of the same thing that applies to normal software
00:16:50: also be more productive with AI, and I got a video coming up on
00:16:53: way, just that that is the case for malicious tasks,
00:16:57: even worse because there you don't even have to
00:17:01: care about things like code quality.
00:17:04: Obviously you want to have a successful attack but in the end if everything's
00:17:08: automated, you also can care a lot about just the
00:17:11: scale. And if you can automate thousands of
00:17:15: attacks to run in parallel, it doesn't really
00:17:19: matter to you if you might have code quality problems
00:17:23: or anything like that. So, uh,
00:17:27: already with the systems today where you could argue about potential
00:17:31: problems they have when it comes to generating code,
00:17:35: matter for attacks like this because you need a result that's just good
00:17:38: enough. And again, chances are definitely high that results
00:17:42: will also get better in the future and we'll be dealing with systems
00:17:46: that don't even have guardrails. And as they say here, less
00:17:50: experienced and resourced groups can now potentially perform
00:17:53: large-scale, uh, attacks of this nature.
00:17:57: This attack is an escalation even on the wipe hacking findings we
00:18:00: reported this summer. In those operations, humans were much, uh, still in
00:18:04: the loop, uh, directing the operations.
00:18:07: Here, human involvement was much less
00:18:09: frequent. And although we have
00:18:13: visibility into Claude usage, this case study probably reflects
00:18:17: of behavior across frontier AI models and demonstrates how threat actors
00:18:20: are adapting their operations. By the way, this is one case that was
00:18:24: caught by Anthropic.... not sure if all the cases are being caught, also
00:18:28: by Google, um, OpenAI and so on. This raises
00:18:32: an important question. If AI models can be misused for
00:18:36: scale, why continue to develop and release them?
00:18:39: The answer is that very, that very, a- abilities that allow Claude to be used in
00:18:43: these attacks also make it crucial for cyber defense.
00:18:46: Well, (smacks lips) uh, that's kind of a weak argument, I'll say,
00:18:49: because if you have one thing that makes a problem much
00:18:53: bigger, saying, "Yeah, but it can also help with the solution,"
00:18:57: kind of bad, right? So, uh, I- I'm not really
00:19:01: sure about that. If we would not have these models...
00:19:04: And just to be clear, that is not (laughs) something that's going to
00:19:07: But if we would not have them, it would probably be better than
00:19:11: in the context of cyberattacks and defense because
00:19:15: will always be one step, uh, behind.
00:19:18: So, uh, I definitely see these tools more as an
00:19:22: advantage for the attackers and a big disadvantage, uh, for
00:19:26: the, uh, companies that have to defend against these attacks.
00:19:29: So that's kind of a weak argument, my argument would be, it doesn't
00:19:33: matter if Anthropic, OpenAI and so on
00:19:37: continue developing AI models, and obviously they will, just to be very
00:19:40: clear. And there are way more arguments to be
00:19:44: cybersecurity. This is just one very i- important and problematic
00:19:48: field, but there are tons of discussions, including philosophical
00:19:52: discussions we could have about AI and if it's good that it's there or
00:19:56: not, but they all don't matter. It is there, it will stay there,
00:19:59: these companies will continuing to develop these models.
00:20:02: And even if they wouldn't, the technology is there.
00:20:05: Bad actors will have access to their own models in the
00:20:09: future. It does not matter at all if companies like Anthropic or
00:20:13: OpenAI continue. The technology is there and the
00:20:17: problems with it are also there, therefore, and they will stay
00:20:20: here. That would be my argument. This argument here doesn't make
00:20:24: too much sense to me. And therefore, definitely scary.
00:20:28: A scary world also from a cybersecurity perspective.
00:20:31: As I mentioned, I only see that, uh, becoming worse
00:20:35: in the future, and therefore, maybe a career in
00:20:39: cybersecurity is worth a second look because yeah, that will
00:20:42: be important.
New comment