Gen AI is a HUGE cybersecurity problem!

Show notes

Full Article: https://www.anthropic.com/news/disrupting-AI-espionage Website: https://maximilian-schwarzmueller.com/

Socials: 👉 Twitch: https://www.twitch.tv/maxedapps 👉 X: https://x.com/maxedapps 👉 Udemy: https://www.udemy.com/user/maximilian-schwarzmuller/ 👉 LinkedIn: https://www.linkedin.com/in/maximilian-schwarzmueller/

Want to become a web developer or expand your web development knowledge? I have multiple bestselling online courses on React, Angular, NodeJS, Docker & much more! 👉 https://academind.com/courses

Show transcript

00:00:00: A disturbing, yet not really surprising article or

00:00:03: post was published by Anthropic yesterday.

00:00:07: An article about an AI-orchestrated

00:00:11: cyber espionage campaign, a cyberattack

00:00:15: carried out with help of AI, with help of Claude code.

00:00:17: And it's an interesting article. I already read it

00:00:21: together with you here. So Anthropic, in this article,

00:00:25: describes a, a cyberattack on

00:00:28: various companies that was carried out almost entirely with

00:00:32: help of AI, almost entirely with help of Claude code,

00:00:36: by jailbreaking Claude code, by getting it to do

00:00:40: things it normally shouldn't do. And let's take a closer look at

00:00:44: how that played out. "In mid-September 2025,

00:00:48: we, Anthropic, detected suspicious activity that later,

00:00:52: investigation determined to be a highly sophisticated espionage campaign.

00:00:56: The attackers used AI's agentic capabilities to an

00:01:00: unprecedented degree, using AI not just as an advisor, but to

00:01:04: execute the cyberattacks themselves." And that's, that

00:01:07: is huge that you can

00:01:11: already use AI, AI that is

00:01:14: published by Anthropic, AI models by Anthropic, and the

00:01:18: Claude code tool by Anthropic to carry out

00:01:22: cyberattacks. And, uh, here's why that is

00:01:25: important. Today we're living in a world where

00:01:29: the most capable models are the

00:01:32: models by OpenAI, Anthropic, X,

00:01:35: Google, and of course we've got some very capable

00:01:39: too, but we're still at a point in time on a,

00:01:43: timeline, we are still at a point in time

00:01:47: where most of these models and the, the most capable

00:01:51: models are controlled by companies,

00:01:55: most of them by companies in democracies.

00:01:57: Now, I'll not say that these companies are saints or

00:02:02: don't do bad stuff, but they're not bad

00:02:06: actors in the sense of this article.

00:02:09: They're not cyber, um, attackers

00:02:12: obviously. However, in the future, in the not too

00:02:16: distant future, we will be at a place

00:02:19: where these models, these very capable models

00:02:24: will also be owned by bad actors

00:02:27: themselves. So right now here in this article,

00:02:31: about a cyberattack that was carried out with help of

00:02:35: Anthropic and Claude code. And it's bad enough that this is possible,

00:02:39: trick those tools and models into doing stuff they shouldn't

00:02:43: we'll get back to how that happened, of course, but that is

00:02:46: today. We still have to apply tricks,

00:02:50: uh, to abuse these models. In a

00:02:53: future, these models will simply belong to the bad

00:02:57: actors themselves. There will be open models that are capable

00:03:01: enough of doing that, so then certain control mechanisms,

00:03:05: get back here in this article, won't even be there anymore.

00:03:09: We'll be in a future where bad actors have

00:03:13: direct, uncontrolled access to very capable models

00:03:17: can be fine-tuned for their purposes, that can be

00:03:21: trained for their purposes, that can use tools that were

00:03:25: purpose built for malicious stuff.

00:03:27: That is where we're heading to, and even today where we're not there

00:03:31: yet or where this is still a niche, even today, we

00:03:35: have fully or almost fully AI-controlled,

00:03:39: uh, cyberattacks. So this is definitely a scary article

00:03:43: and, and a scary future, which makes it very clear that

00:03:47: cybersecurity and

00:03:49: preventing attacks will be a super

00:03:53: big challenge. It always has been, but now with AI where everything can be

00:03:57: quicker and more automated and harder to trace back to

00:04:00: individuals, it will be an even more important

00:04:04: topic. But back to this article. "The threat

00:04:07: actor, whom we assess with high confidence

00:04:11: state-sponsored group, manipulated our Claude code

00:04:15: tool into attempting infiltration into roughly 30

00:04:19: global targets and succeeded in a small number of cases." So it was not just

00:04:23: an attempt. They succeeded. "The operation

00:04:27: targeted large tech companies, financial institutions,

00:04:31: companies, and government agencies.

00:04:34: Uh, we believe this is the first documented case of a

00:04:37: cyberattack executed without substantial human intervention." This is so

00:04:41: big, without substantial human

00:04:43: intervention. And again, that is why this is

00:04:47: such a scary future where bad actors don't even

00:04:51: need to rely on, um, these controlled

00:04:55: AI models, uh, which they still have to rely on today.

00:04:58: "Upon detecting this activity, we immediately launched an investigation

00:05:02: understand its scope and nature. Uh, over the following 10 days, as we

00:05:06: mapped the severity and full extent of the operation,

00:05:09: they were identified." So again, these were really regular

00:05:13: Claude accounts. These people were using

00:05:16: Claude, the model hosted by Anthropic.

00:05:19: They did not kind of steal it, run it on their own servers

00:05:23: models. They used the models you can use too via

00:05:27: via Claude code. This campaign has substantial

00:05:31: implications for cybersecurity in the age of AI agents, as I just said,

00:05:35: because everything can be automated and it's already possible today, uh,

00:05:39: where there are guardrails in place.

00:05:42: And again, think of that future where we have no guardrails

00:05:45: actors don't have guardrails. "Uh, these attacks are likely

00:05:49: to only grow in their effectiveness.

00:05:51: To keep pace with this rapidly advancing threat, we have expanded our

00:05:55: detection capabilities and developed better classifiers to flag

00:05:59: activities." And that's the important part here.This is what

00:06:03: Anthropic is trying to do today to make sure that their

00:06:06: models can't be abused for malicious tasks,

00:06:10: that Claude code can't be abused and ultimately,

00:06:14: their APIs, which Claude code uses in the end.

00:06:18: This all won't matter at all in the future because

00:06:22: in a future where bad actors themselves have their own models

00:06:25: running on their own servers, these guardrails won't

00:06:29: matter. Well, obviously they will still matter.

00:06:31: Obviously, you still want to make it easier than it's

00:06:35: obviously not all bad actors will have their

00:06:39: own malicious models, but especially if we're talking about

00:06:42: state-controlled bad actors or big

00:06:46: cyber attacker groups. Let's, let's be

00:06:50: real. Of course they will have access to their

00:06:54: on their own servers, so this will not matter at

00:06:57: all in that future. Obviously, it will matter still because you don't

00:07:01: make it easier and it will at least filter out a significant

00:07:05: group of potential bad actors that don't have access to their own

00:07:09: models. So yeah, it's important, but it will not be enough.

00:07:13: Companies themselves need to ramp up

00:07:17: their cybersecurity game, which is easier said than done because

00:07:21: it has been true for the last 10 years, of course, even without AI, but it's

00:07:25: becoming even more important in the age of AI.

00:07:30: So yeah, that is, that is the big problem here.

00:07:33: Now, let's see how the, uh, cyberattack worked.

00:07:37: Uh, the attack relied on several features of

00:07:40: exist or were in much more nascent form just a year ago.

00:07:44: Intelligence: Models' general levels of capability have

00:07:48: point that they can follow complicance- complex instructions and understand

00:07:52: context in ways that make very sophisticated tasks possible.

00:07:56: Not only that, but several of their well-developed

00:07:59: particular software coding, lend themselves to being used in

00:08:03: Sure, because now with models that are smarter,

00:08:07: and just to be very clear here, we're still talking about models that

00:08:10: just generate tokens, but of course by generating these tokens

00:08:14: that are able, they are able to describe the usage of tools

00:08:18: tools, they become more capable. With all the

00:08:22: fine-tuning they received, they also generate more tokens that are

00:08:26: likely to be or more likely to be the tokens you want to generate, so that

00:08:30: is what intelligence means here. They're not really intelligent, but they

00:08:34: have been tuned especially for software development such that

00:08:38: they are much more likely to generate meaningful output

00:08:42: and especially also output that allows them to describe tool

00:08:46: use and then use those tools, so execute code

00:08:49: that does something. Uh, for example, send an HTTP request and so

00:08:53: on, and it's that combination that makes them more capable

00:08:57: in the end. And of course, yeah, that's exactly what you need for automated

00:09:01: cyberattacks, because you need a model that's able to

00:09:05: follow your instructions related to that.

00:09:07: You need a model that's able to send HTTP requests, phishing

00:09:10: emails, whatever, and all that is what these models can do

00:09:14: quite well in the end. Agency: Models can act as agents.

00:09:18: That is, they can run in loops where they take

00:09:21: tasks and make decisions with only minimal occasional human

00:09:24: input. That's another important step.

00:09:27: This is what allows these, um, systems or these

00:09:31: attacks here in this case to work with only minimal human input

00:09:36: because these models and the software that uses

00:09:40: these models can go for so much longer and it's important to

00:09:43: differentiate here. The model is, is still just the

00:09:47: thing that receives a prompt and sends back some tokens.

00:09:51: That, that has not changed. But it's the software, Claude Code, for

00:09:55: example, that then takes that output and sends back another

00:09:59: message to the same API with that output, with the original

00:10:02: task, with some meta instructions like, "Please check if that

00:10:06: output answers the question by the user.

00:10:09: You got these tools available, please tell me if you want to use a tool."

00:10:12: how the software around the models in the end makes these

00:10:16: models more capable, not because the model does everything on

00:10:20: the model is capable of giving the software the

00:10:24: result it needs. The software then feeds these enriched

00:10:28: the model and it's this loop that keeps the whole system going and that

00:10:32: leads to these agentic systems that can go

00:10:36: on for longer, that can use tools and that require less

00:10:39: human input. And yeah, tools, that is therefore the other missing piece

00:10:43: here, of course, that models have access to a wide array of

00:10:47: tools. They can now search the web, retrieve data, perform many other

00:10:50: actions that were previously the sole domain of human operators.

00:10:53: In the case of cyberattacks, the tools might include password crackers,

00:10:57: scanners and other security-related software.

00:11:00: Because again, it's not all just GitHub MCPs.

00:11:04: It can be all kind of tools you could, uh, expose

00:11:07: to your, um, model or to the software that uses

00:11:11: these models and that runs these agentic tasks.

00:11:14: So they got a nice diagram in this article, but in the end

00:11:18: the attack played out relatively, uh,

00:11:21: simple. They, they describe it in greater detail down

00:11:25: there. They convinced Claude Code to do

00:11:29: stuff it normally shouldn't be able to do.

00:11:31: They had to convince Claude, which is extensively trained to avoid harmful

00:11:34: behaviors, to engage in the attack.

00:11:36: They did so by jailbreaking it, effectively tricking it to bypass

00:11:40: its guardrails, and that's the part where Anthropic

00:11:44: back better, not just in Claude Code but in the

00:11:48: models themselves, so on their API where they scan all the

00:11:52: requests that reach their models, so to say, and take

00:11:56: better... eh, they try to do a better job at detecting

00:12:00: injections in the end because these attackers broke down their attacks into

00:12:04: small, seemingly innocent tasks that Claude would execute without being

00:12:07: provided the full context of their malicious purpose.

00:12:10: They also told Claude that it was an employee of a legitimate

00:12:14: cybersecurity firm and was being used in defensive testing.That's

00:12:18: good old trick. I think that is how jailbreaking was already done two

00:12:22: years ago with the early ChatGPT models.

00:12:25: Eh, you tell it that you need this information

00:12:29: and it'll happily expose its system prompt.

00:12:33: Kind of a simplification but that's still how prompt injections can

00:12:36: days. That you try to apply various techniques

00:12:40: are very interesting techniques when it comes to that,

00:12:44: eh, including the use of special tokens you

00:12:48: message to tr- to get the

00:12:51: model to generate output it normally shouldn't

00:12:54: generate. The attackers then initiated the second

00:12:58: involved Claude code ins- inspecting the target organization's

00:13:01: So yeah, that's then essentially what Claude code

00:13:04: did. Then with minimal human input, um, it

00:13:08: is in the end then used its agentic capabilities, its

00:13:12: tools, to really, um,

00:13:15: scan networks, write code, and do all that

00:13:19: stuff without a human telling it exactly what to do.

00:13:23: So, it was, as mentioned earlier, uh, a fully or almost fully

00:13:27: automated attack. So, in the next phases of the attack, Claude

00:13:31: identified and tested security vulnerabilities in the target

00:13:35: systems by researching and writing its own exploit

00:13:38: code. Having done so, the framework was able to use Claude to

00:13:42: harvest credentials, usernames and passwords that allowed it to further

00:13:45: then extract a large amount of private data.

00:13:48: So, it did really research, write the code to

00:13:52: get into systems, uh, of other companies and then in those

00:13:56: systems write more code to extract data,

00:13:59: um, and- and- and compromise these systems and- and- and

00:14:03: do bad stuff in there once it was in there.

00:14:07: The highest privileged accounts were identified, backdoors

00:14:11: all the stuff that happened after it was in the systems and data were exfil-

00:14:14: exfiltrated with minimal human supervision.

00:14:18: In a final phase, the attackers had Claude produce

00:14:21: the attack, creating helpful files with the stolen

00:14:25: analyzed which would assist the framework in planning the next stage of the

00:14:28: threats actor, eh, of the threat actor's cyber operations.

00:14:32: Overall, the threat actor was able to use AI to perform 80 to 90% of

00:14:36: campaign with human intervention required only sporadically, perhaps four to

00:14:40: six critical decision points per hacking campaign.

00:14:42: That is nothing. That is nothing. That is such

00:14:47: a scale at which you can run these attacks and again, especially in a

00:14:50: future where you don't have to work against certain guardrails, where you can

00:14:54: just focus on getting the job done and you have to spend

00:14:58: energy on getting around guardrails.

00:15:00: That is really a scary future. This degree of

00:15:03: automation is really, really, uh, scary

00:15:07: here. Claude didn't always work perfectly.

00:15:09: It occasionally hallucinated credentials or claimed to have extracted secret

00:15:13: information that was in fact publicly available.

00:15:15: This remains an obstacle to fully, uh, autonomous cyber attacks and this is

00:15:19: not just an obstacle for cyber attacks, this is of course an obstacle for

00:15:22: everybody, for us developers too because

00:15:26: problem and will stay a problem because as I mentioned before, it's so

00:15:30: easy to forget but these are token generation

00:15:34: machines. Always have been, always will be.

00:15:36: The large language models, I mean.

00:15:38: They are generating tokens and they are generating

00:15:42: the most likely token as the next token based on all the

00:15:46: tokens that came before it, and that is something that can and

00:15:50: always will have the danger of

00:15:53: hallucinating. So, that of course is a problem in

00:15:56: general. Good to see that it can then also be helpful

00:16:00: when it comes to defending against malicious tasks because

00:16:04: those also are hurt by hallucination but of course that is a general

00:16:08: problem, uh, we face, uh, and it will not be a

00:16:11: significant, uh, defense mechanism

00:16:14: unfortunately. Because in the end if everything's automated, it's

00:16:18: just a question of scale and if some attacks fail because of

00:16:22: hallucination, well, does that really matter if you can run

00:16:26: thousands of attacks in parallel?

00:16:28: I'm not sure it does. Cybersecurity implications.

00:16:31: The barriers to performing sophisticated cyber attacks have dropped

00:16:35: substantially and we predict that they'll continue to do so.

00:16:38: With the correct setup, threat actors can now use agentic AI

00:16:42: extended periods to do the work of entire teams of experienced hackers.

00:16:46: So kind of the same thing that applies to normal software

00:16:50: also be more productive with AI, and I got a video coming up on

00:16:53: way, just that that is the case for malicious tasks,

00:16:57: even worse because there you don't even have to

00:17:01: care about things like code quality.

00:17:04: Obviously you want to have a successful attack but in the end if everything's

00:17:08: automated, you also can care a lot about just the

00:17:11: scale. And if you can automate thousands of

00:17:15: attacks to run in parallel, it doesn't really

00:17:19: matter to you if you might have code quality problems

00:17:23: or anything like that. So, uh,

00:17:27: already with the systems today where you could argue about potential

00:17:31: problems they have when it comes to generating code,

00:17:35: matter for attacks like this because you need a result that's just good

00:17:38: enough. And again, chances are definitely high that results

00:17:42: will also get better in the future and we'll be dealing with systems

00:17:46: that don't even have guardrails. And as they say here, less

00:17:50: experienced and resourced groups can now potentially perform

00:17:53: large-scale, uh, attacks of this nature.

00:17:57: This attack is an escalation even on the wipe hacking findings we

00:18:00: reported this summer. In those operations, humans were much, uh, still in

00:18:04: the loop, uh, directing the operations.

00:18:07: Here, human involvement was much less

00:18:09: frequent. And although we have

00:18:13: visibility into Claude usage, this case study probably reflects

00:18:17: of behavior across frontier AI models and demonstrates how threat actors

00:18:20: are adapting their operations. By the way, this is one case that was

00:18:24: caught by Anthropic.... not sure if all the cases are being caught, also

00:18:28: by Google, um, OpenAI and so on. This raises

00:18:32: an important question. If AI models can be misused for

00:18:36: scale, why continue to develop and release them?

00:18:39: The answer is that very, that very, a- abilities that allow Claude to be used in

00:18:43: these attacks also make it crucial for cyber defense.

00:18:46: Well, (smacks lips) uh, that's kind of a weak argument, I'll say,

00:18:49: because if you have one thing that makes a problem much

00:18:53: bigger, saying, "Yeah, but it can also help with the solution,"

00:18:57: kind of bad, right? So, uh, I- I'm not really

00:19:01: sure about that. If we would not have these models...

00:19:04: And just to be clear, that is not (laughs) something that's going to

00:19:07: But if we would not have them, it would probably be better than

00:19:11: in the context of cyberattacks and defense because

00:19:15: will always be one step, uh, behind.

00:19:18: So, uh, I definitely see these tools more as an

00:19:22: advantage for the attackers and a big disadvantage, uh, for

00:19:26: the, uh, companies that have to defend against these attacks.

00:19:29: So that's kind of a weak argument, my argument would be, it doesn't

00:19:33: matter if Anthropic, OpenAI and so on

00:19:37: continue developing AI models, and obviously they will, just to be very

00:19:40: clear. And there are way more arguments to be

00:19:44: cybersecurity. This is just one very i- important and problematic

00:19:48: field, but there are tons of discussions, including philosophical

00:19:52: discussions we could have about AI and if it's good that it's there or

00:19:56: not, but they all don't matter. It is there, it will stay there,

00:19:59: these companies will continuing to develop these models.

00:20:02: And even if they wouldn't, the technology is there.

00:20:05: Bad actors will have access to their own models in the

00:20:09: future. It does not matter at all if companies like Anthropic or

00:20:13: OpenAI continue. The technology is there and the

00:20:17: problems with it are also there, therefore, and they will stay

00:20:20: here. That would be my argument. This argument here doesn't make

00:20:24: too much sense to me. And therefore, definitely scary.

00:20:28: A scary world also from a cybersecurity perspective.

00:20:31: As I mentioned, I only see that, uh, becoming worse

00:20:35: in the future, and therefore, maybe a career in

00:20:39: cybersecurity is worth a second look because yeah, that will

00:20:42: be important.

Show notes

Show transcript

New comment