Gen AI is a HUGE cybersecurity problem!

Show notes

Full Article: https://www.anthropic.com/news/disrupting-AI-espionage Website: https://maximilian-schwarzmueller.com/

Socials: 👉 Twitch: https://www.twitch.tv/maxedapps 👉 X: https://x.com/maxedapps 👉 Udemy: https://www.udemy.com/user/maximilian-schwarzmuller/ 👉 LinkedIn: https://www.linkedin.com/in/maximilian-schwarzmueller/

Want to become a web developer or expand your web development knowledge? I have multiple bestselling online courses on React, Angular, NodeJS, Docker & much more! 👉 https://academind.com/courses

Show transcript

00:00:00: A disturbing, yet not really surprising article or

00:00:03: post was published by Anthropic yesterday.

00:00:07: An article about an AI-orchestrated

00:00:11: cyber espionage campaign, a cyberattack

00:00:15: carried out with help of AI, with help of Claude code.

00:00:17: And it's an interesting article. I already read it

00:00:21: together with you here. So Anthropic, in this article,

00:00:25: describes a, a cyberattack on

00:00:28: various companies that was carried out almost entirely with

00:00:32: help of AI, almost entirely with help of Claude code,

00:00:36: by jailbreaking Claude code, by getting it to do

00:00:40: things it normally shouldn't do. And let's take a closer look at

00:00:44: how that played out. "In mid-September 2025,

00:00:48: we, Anthropic, detected suspicious activity that later,

00:00:52: investigation determined to be a highly sophisticated espionage campaign.

00:00:56: The attackers used AI's agentic capabilities to an

00:01:00: unprecedented degree, using AI not just as an advisor, but to

00:01:04: execute the cyberattacks themselves." And that's, that

00:01:07: is huge that you can

00:01:11: already use AI, AI that is

00:01:14: published by Anthropic, AI models by Anthropic, and the

00:01:18: Claude code tool by Anthropic to carry out

00:01:22: cyberattacks. And, uh, here's why that is

00:01:25: important. Today we're living in a world where

00:01:29: the most capable models are the

00:01:32: models by OpenAI, Anthropic, X,

00:01:35: Google, and of course we've got some very capable

00:01:39: too, but we're still at a point in time on a,

00:01:43: timeline, we are still at a point in time

00:01:47: where most of these models and the, the most capable

00:01:51: models are controlled by companies,

00:01:55: most of them by companies in democracies.

00:01:57: Now, I'll not say that these companies are saints or

00:02:02: don't do bad stuff, but they're not bad

00:02:06: actors in the sense of this article.

00:02:09: They're not cyber, um, attackers

00:02:12: obviously. However, in the future, in the not too

00:02:16: distant future, we will be at a place

00:02:19: where these models, these very capable models

00:02:24: will also be owned by bad actors

00:02:27: themselves. So right now here in this article,

00:02:31: about a cyberattack that was carried out with help of

00:02:35: Anthropic and Claude code. And it's bad enough that this is possible,

00:02:39: trick those tools and models into doing stuff they shouldn't

00:02:43: we'll get back to how that happened, of course, but that is

00:02:46: today. We still have to apply tricks,

00:02:50: uh, to abuse these models. In a

00:02:53: future, these models will simply belong to the bad

00:02:57: actors themselves. There will be open models that are capable

00:03:01: enough of doing that, so then certain control mechanisms,

00:03:05: get back here in this article, won't even be there anymore.

00:03:09: We'll be in a future where bad actors have

00:03:13: direct, uncontrolled access to very capable models

00:03:17: can be fine-tuned for their purposes, that can be

00:03:21: trained for their purposes, that can use tools that were

00:03:25: purpose built for malicious stuff.

00:03:27: That is where we're heading to, and even today where we're not there

00:03:31: yet or where this is still a niche, even today, we

00:03:35: have fully or almost fully AI-controlled,

00:03:39: uh, cyberattacks. So this is definitely a scary article

00:03:43: and, and a scary future, which makes it very clear that

00:03:47: cybersecurity and

00:03:49: preventing attacks will be a super

00:03:53: big challenge. It always has been, but now with AI where everything can be

00:03:57: quicker and more automated and harder to trace back to

00:04:00: individuals, it will be an even more important

00:04:04: topic. But back to this article. "The threat

00:04:07: actor, whom we assess with high confidence

00:04:11: state-sponsored group, manipulated our Claude code

00:04:15: tool into attempting infiltration into roughly 30

00:04:19: global targets and succeeded in a small number of cases." So it was not just

00:04:23: an attempt. They succeeded. "The operation

00:04:27: targeted large tech companies, financial institutions,

00:04:31: companies, and government agencies.

00:04:34: Uh, we believe this is the first documented case of a

00:04:37: cyberattack executed without substantial human intervention." This is so

00:04:41: big, without substantial human

00:04:43: intervention. And again, that is why this is

00:04:47: such a scary future where bad actors don't even

00:04:51: need to rely on, um, these controlled

00:04:55: AI models, uh, which they still have to rely on today.

00:04:58: "Upon detecting this activity, we immediately launched an investigation

00:05:02: understand its scope and nature. Uh, over the following 10 days, as we

00:05:06: mapped the severity and full extent of the operation,

00:05:09: they were identified." So again, these were really regular

00:05:13: Claude accounts. These people were using

00:05:16: Claude, the model hosted by Anthropic.

00:05:19: They did not kind of steal it, run it on their own servers

00:05:23: models. They used the models you can use too via

00:05:27: via Claude code. This campaign has substantial

00:05:31: implications for cybersecurity in the age of AI agents, as I just said,

00:05:35: because everything can be automated and it's already possible today, uh,

00:05:39: where there are guardrails in place.

00:05:42: And again, think of that future where we have no guardrails

00:05:45: actors don't have guardrails. "Uh, these attacks are likely

00:05:49: to only grow in their effectiveness.

00:05:51: To keep pace with this rapidly advancing threat, we have expanded our

00:05:55: detection capabilities and developed better classifiers to flag

00:05:59: activities." And that's the important part here.This is what

00:06:03: Anthropic is trying to do today to make sure that their

00:06:06: models can't be abused for malicious tasks,

00:06:10: that Claude code can't be abused and ultimately,

00:06:14: their APIs, which Claude code uses in the end.

00:06:18: This all won't matter at all in the future because

00:06:22: in a future where bad actors themselves have their own models

00:06:25: running on their own servers, these guardrails won't

00:06:29: matter. Well, obviously they will still matter.

00:06:31: Obviously, you still want to make it easier than it's

00:06:35: obviously not all bad actors will have their

00:06:39: own malicious models, but especially if we're talking about

00:06:42: state-controlled bad actors or big

00:06:46: cyber attacker groups. Let's, let's be

00:06:50: real. Of course they will have access to their

00:06:54: on their own servers, so this will not matter at

00:06:57: all in that future. Obviously, it will matter still because you don't

00:07:01: make it easier and it will at least filter out a significant

00:07:05: group of potential bad actors that don't have access to their own

00:07:09: models. So yeah, it's important, but it will not be enough.

00:07:13: Companies themselves need to ramp up

00:07:17: their cybersecurity game, which is easier said than done because

00:07:21: it has been true for the last 10 years, of course, even without AI, but it's

00:07:25: becoming even more important in the age of AI.

00:07:30: So yeah, that is, that is the big problem here.

00:07:33: Now, let's see how the, uh, cyberattack worked.

00:07:37: Uh, the attack relied on several features of

00:07:40: exist or were in much more nascent form just a year ago.

00:07:44: Intelligence: Models' general levels of capability have

00:07:48: point that they can follow complicance- complex instructions and understand

00:07:52: context in ways that make very sophisticated tasks possible.

00:07:56: Not only that, but several of their well-developed

00:07:59: particular software coding, lend themselves to being used in

00:08:03: Sure, because now with models that are smarter,

00:08:07: and just to be very clear here, we're still talking about models that

00:08:10: just generate tokens, but of course by generating these tokens

00:08:14: that are able, they are able to describe the usage of tools

00:08:18: tools, they become more capable. With all the

00:08:22: fine-tuning they received, they also generate more tokens that are

00:08:26: likely to be or more likely to be the tokens you want to generate, so that

00:08:30: is what intelligence means here. They're not really intelligent, but they

00:08:34: have been tuned especially for software development such that

00:08:38: they are much more likely to generate meaningful output

00:08:42: and especially also output that allows them to describe tool

00:08:46: use and then use those tools, so execute code

00:08:49: that does something. Uh, for example, send an HTTP request and so

00:08:53: on, and it's that combination that makes them more capable

00:08:57: in the end. And of course, yeah, that's exactly what you need for automated

00:09:01: cyberattacks, because you need a model that's able to

00:09:05: follow your instructions related to that.

00:09:07: You need a model that's able to send HTTP requests, phishing

00:09:10: emails, whatever, and all that is what these models can do

00:09:14: quite well in the end. Agency: Models can act as agents.

00:09:18: That is, they can run in loops where they take

00:09:21: tasks and make decisions with only minimal occasional human

00:09:24: input. That's another important step.

00:09:27: This is what allows these, um, systems or these

00:09:31: attacks here in this case to work with only minimal human input

00:09:36: because these models and the software that uses

00:09:40: these models can go for so much longer and it's important to

00:09:43: differentiate here. The model is, is still just the

00:09:47: thing that receives a prompt and sends back some tokens.

00:09:51: That, that has not changed. But it's the software, Claude Code, for

00:09:55: example, that then takes that output and sends back another

00:09:59: message to the same API with that output, with the original

00:10:02: task, with some meta instructions like, "Please check if that

00:10:06: output answers the question by the user.

00:10:09: You got these tools available, please tell me if you want to use a tool."

00:10:12: how the software around the models in the end makes these

00:10:16: models more capable, not because the model does everything on

00:10:20: the model is capable of giving the software the

00:10:24: result it needs. The software then feeds these enriched

00:10:28: the model and it's this loop that keeps the whole system going and that

00:10:32: leads to these agentic systems that can go

00:10:36: on for longer, that can use tools and that require less

00:10:39: human input. And yeah, tools, that is therefore the other missing piece

00:10:43: here, of course, that models have access to a wide array of

00:10:47: tools. They can now search the web, retrieve data, perform many other

00:10:50: actions that were previously the sole domain of human operators.

00:10:53: In the case of cyberattacks, the tools might include password crackers,

00:10:57: scanners and other security-related software.

00:11:00: Because again, it's not all just GitHub MCPs.

00:11:04: It can be all kind of tools you could, uh, expose

00:11:07: to your, um, model or to the software that uses

00:11:11: these models and that runs these agentic tasks.

00:11:14: So they got a nice diagram in this article, but in the end

00:11:18: the attack played out relatively, uh,

00:11:21: simple. They, they describe it in greater detail down

00:11:25: there. They convinced Claude Code to do

00:11:29: stuff it normally shouldn't be able to do.

00:11:31: They had to convince Claude, which is extensively trained to avoid harmful

00:11:34: behaviors, to engage in the attack.

00:11:36: They did so by jailbreaking it, effectively tricking it to bypass

00:11:40: its guardrails, and that's the part where Anthropic

00:11:44: back better, not just in Claude Code but in the

00:11:48: models themselves, so on their API where they scan all the

00:11:52: requests that reach their models, so to say, and take

00:11:56: better... eh, they try to do a better job at detecting

00:12:00: injections in the end because these attackers broke down their attacks into

00:12:04: small, seemingly innocent tasks that Claude would execute without being

00:12:07: provided the full context of their malicious purpose.

00:12:10: They also told Claude that it was an employee of a legitimate

00:12:14: cybersecurity firm and was being used in defensive testing.That's

00:12:18: good old trick. I think that is how jailbreaking was already done two

00:12:22: years ago with the early ChatGPT models.

00:12:25: Eh, you tell it that you need this information

00:12:29: and it'll happily expose its system prompt.

00:12:33: Kind of a simplification but that's still how prompt injections can

00:12:36: days. That you try to apply various techniques

00:12:40: are very interesting techniques when it comes to that,

00:12:44: eh, including the use of special tokens you

00:12:48: message to tr- to get the

00:12:51: model to generate output it normally shouldn't

00:12:54: generate. The attackers then initiated the second

00:12:58: involved Claude code ins- inspecting the target organization's

00:13:01: So yeah, that's then essentially what Claude code

00:13:04: did. Then with minimal human input, um, it

00:13:08: is in the end then used its agentic capabilities, its

00:13:12: tools, to really, um,

00:13:15: scan networks, write code, and do all that

00:13:19: stuff without a human telling it exactly what to do.

00:13:23: So, it was, as mentioned earlier, uh, a fully or almost fully

00:13:27: automated attack. So, in the next phases of the attack, Claude

00:13:31: identified and tested security vulnerabilities in the target

00:13:35: systems by researching and writing its own exploit

00:13:38: code. Having done so, the framework was able to use Claude to

00:13:42: harvest credentials, usernames and passwords that allowed it to further

00:13:45: then extract a large amount of private data.

00:13:48: So, it did really research, write the code to

00:13:52: get into systems, uh, of other companies and then in those

00:13:56: systems write more code to extract data,

00:13:59: um, and- and- and compromise these systems and- and- and

00:14:03: do bad stuff in there once it was in there.

00:14:07: The highest privileged accounts were identified, backdoors

00:14:11: all the stuff that happened after it was in the systems and data were exfil-

00:14:14: exfiltrated with minimal human supervision.

00:14:18: In a final phase, the attackers had Claude produce

00:14:21: the attack, creating helpful files with the stolen

00:14:25: analyzed which would assist the framework in planning the next stage of the

00:14:28: threats actor, eh, of the threat actor's cyber operations.

00:14:32: Overall, the threat actor was able to use AI to perform 80 to 90% of

00:14:36: campaign with human intervention required only sporadically, perhaps four to

00:14:40: six critical decision points per hacking campaign.

00:14:42: That is nothing. That is nothing. That is such

00:14:47: a scale at which you can run these attacks and again, especially in a

00:14:50: future where you don't have to work against certain guardrails, where you can

00:14:54: just focus on getting the job done and you have to spend

00:14:58: energy on getting around guardrails.

00:15:00: That is really a scary future. This degree of

00:15:03: automation is really, really, uh, scary

00:15:07: here. Claude didn't always work perfectly.

00:15:09: It occasionally hallucinated credentials or claimed to have extracted secret

00:15:13: information that was in fact publicly available.

00:15:15: This remains an obstacle to fully, uh, autonomous cyber attacks and this is

00:15:19: not just an obstacle for cyber attacks, this is of course an obstacle for

00:15:22: everybody, for us developers too because

00:15:26: problem and will stay a problem because as I mentioned before, it's so

00:15:30: easy to forget but these are token generation

00:15:34: machines. Always have been, always will be.

00:15:36: The large language models, I mean.

00:15:38: They are generating tokens and they are generating

00:15:42: the most likely token as the next token based on all the

00:15:46: tokens that came before it, and that is something that can and

00:15:50: always will have the danger of

00:15:53: hallucinating. So, that of course is a problem in

00:15:56: general. Good to see that it can then also be helpful

00:16:00: when it comes to defending against malicious tasks because

00:16:04: those also are hurt by hallucination but of course that is a general

00:16:08: problem, uh, we face, uh, and it will not be a

00:16:11: significant, uh, defense mechanism

00:16:14: unfortunately. Because in the end if everything's automated, it's

00:16:18: just a question of scale and if some attacks fail because of

00:16:22: hallucination, well, does that really matter if you can run

00:16:26: thousands of attacks in parallel?

00:16:28: I'm not sure it does. Cybersecurity implications.

00:16:31: The barriers to performing sophisticated cyber attacks have dropped

00:16:35: substantially and we predict that they'll continue to do so.

00:16:38: With the correct setup, threat actors can now use agentic AI

00:16:42: extended periods to do the work of entire teams of experienced hackers.

00:16:46: So kind of the same thing that applies to normal software

00:16:50: also be more productive with AI, and I got a video coming up on

00:16:53: way, just that that is the case for malicious tasks,

00:16:57: even worse because there you don't even have to

00:17:01: care about things like code quality.

00:17:04: Obviously you want to have a successful attack but in the end if everything's

00:17:08: automated, you also can care a lot about just the

00:17:11: scale. And if you can automate thousands of

00:17:15: attacks to run in parallel, it doesn't really

00:17:19: matter to you if you might have code quality problems

00:17:23: or anything like that. So, uh,

00:17:27: already with the systems today where you could argue about potential

00:17:31: problems they have when it comes to generating code,

00:17:35: matter for attacks like this because you need a result that's just good

00:17:38: enough. And again, chances are definitely high that results

00:17:42: will also get better in the future and we'll be dealing with systems

00:17:46: that don't even have guardrails. And as they say here, less

00:17:50: experienced and resourced groups can now potentially perform

00:17:53: large-scale, uh, attacks of this nature.

00:17:57: This attack is an escalation even on the wipe hacking findings we

00:18:00: reported this summer. In those operations, humans were much, uh, still in

00:18:04: the loop, uh, directing the operations.

00:18:07: Here, human involvement was much less

00:18:09: frequent. And although we have

00:18:13: visibility into Claude usage, this case study probably reflects

00:18:17: of behavior across frontier AI models and demonstrates how threat actors

00:18:20: are adapting their operations. By the way, this is one case that was

00:18:24: caught by Anthropic.... not sure if all the cases are being caught, also

00:18:28: by Google, um, OpenAI and so on. This raises

00:18:32: an important question. If AI models can be misused for

00:18:36: scale, why continue to develop and release them?

00:18:39: The answer is that very, that very, a- abilities that allow Claude to be used in

00:18:43: these attacks also make it crucial for cyber defense.

00:18:46: Well, (smacks lips) uh, that's kind of a weak argument, I'll say,

00:18:49: because if you have one thing that makes a problem much

00:18:53: bigger, saying, "Yeah, but it can also help with the solution,"

00:18:57: kind of bad, right? So, uh, I- I'm not really

00:19:01: sure about that. If we would not have these models...

00:19:04: And just to be clear, that is not (laughs) something that's going to

00:19:07: But if we would not have them, it would probably be better than

00:19:11: in the context of cyberattacks and defense because

00:19:15: will always be one step, uh, behind.

00:19:18: So, uh, I definitely see these tools more as an

00:19:22: advantage for the attackers and a big disadvantage, uh, for

00:19:26: the, uh, companies that have to defend against these attacks.

00:19:29: So that's kind of a weak argument, my argument would be, it doesn't

00:19:33: matter if Anthropic, OpenAI and so on

00:19:37: continue developing AI models, and obviously they will, just to be very

00:19:40: clear. And there are way more arguments to be

00:19:44: cybersecurity. This is just one very i- important and problematic

00:19:48: field, but there are tons of discussions, including philosophical

00:19:52: discussions we could have about AI and if it's good that it's there or

00:19:56: not, but they all don't matter. It is there, it will stay there,

00:19:59: these companies will continuing to develop these models.

00:20:02: And even if they wouldn't, the technology is there.

00:20:05: Bad actors will have access to their own models in the

00:20:09: future. It does not matter at all if companies like Anthropic or

00:20:13: OpenAI continue. The technology is there and the

00:20:17: problems with it are also there, therefore, and they will stay

00:20:20: here. That would be my argument. This argument here doesn't make

00:20:24: too much sense to me. And therefore, definitely scary.

00:20:28: A scary world also from a cybersecurity perspective.

00:20:31: As I mentioned, I only see that, uh, becoming worse

00:20:35: in the future, and therefore, maybe a career in

00:20:39: cybersecurity is worth a second look because yeah, that will

00:20:42: be important.

New comment

Your name or nickname, will be shown publicly
At least 10 characters long
By submitting your comment you agree that the content of the field "Name or nickname" will be stored and shown publicly next to your comment. Using your real name is optional.