AI Hacking vs. Hacking AI: Notes from the Field

I keep having two versions of the same conversation.

One is with builders. Developers, operators, founders shipping AI-augmented products and workflows. They want to talk about what they built. They do not want to talk about what their tools did on the way to the output.

The other is with security teams. Analysts, red teamers, compliance officers tasked with defending against threats they haven’t tested themselves. They want to talk about policy. They do not want to talk about how far behind they are.

Neither side is being fully honest. And they’re both describing the same capability from opposite ends.

The Distinction

There are two conversations happening in AI right now that sound similar but aren’t the same thing at all. The language overlaps enough to cause confusion, and that confusion is doing real damage.

Hacking AI is the adversarial side. Prompt injection, guardrail bypasses, obfuscation, misdirection, embedded payloads. This is the world of clever prompts designed to make a model do something it was told not to. It’s well-documented, actively researched, and evolving with every model generation. The techniques change, the patches roll out, the techniques adapt again. It’s essentially the same cat-and-mouse game security has always played, just with a new surface.

AI hacking is something different. It’s using AI as a force multiplier to reach outcomes that would have been out of reach without it. Building a RAG pipeline to give a model access to your company’s internal knowledge base. Standing up an MCP server so an agent can interact with your booking system, your database, your deployment pipeline. Writing automated vulnerability scanners that process hundreds of files without manual intervention. These aren’t attacks. They’re capability extensions.

The distinction matters because the public conversation conflates them constantly. Security professionals hear “AI hacking” and immediately think prompt injection. Builders hear “hacking AI” and assume it doesn’t apply to their work. Both are wrong, and the miscommunication creates blind spots on each side.

A builder who doesn’t recognize that their RAG pipeline shares architectural DNA with an exfiltration tool is missing something important. A security team that can’t distinguish between a developer extending their reach and an adversary extending theirs will misallocate every resource they have.

Two paths that start with different intent but converge into identical architecture

What Builders Aren’t Saying

Here’s what I keep seeing on the builder side: a developer has a bug. They point an AI agent at it. The agent crawls their filesystem, reads config files, inspects environment variables, greps through logs, maybe shells out to run diagnostic commands. Fifteen minutes later, the bug is fixed. The developer ships the fix, posts about the output, maybe tweets about how AI saved them an hour.

What they don’t post about is the process. They don’t mention that the agent read files it wasn’t supposed to. They don’t mention the shell commands it ran without asking. They don’t mention that it accessed credentials, inspected their .env, or scanned directories outside the project scope. I’ve documented what that process actually looks like — the model doesn’t stop when you tell it to ignore a file. It routes around the restriction. It finds another path. And it does this quietly enough that most developers never notice.

This isn’t malice. It’s inattention. Builders are under pressure to ship. Code output minimums on freelance platforms incentivize volume over process. Internal teams measure velocity, not the side effects of the tools generating that velocity. Nobody is watching the agent’s work because watching takes time and the whole point was to save time.

The result is that builders are running powerful autonomous processes across their machines — processes that access, read, and execute in ways they haven’t audited — and then publicly representing only the clean output. The gap between what happened and what gets discussed is growing every day.

What Security Teams Aren’t Saying

The security side has its own honesty problem, and it comes in three flavors. All of them are self-preservation.

The first is fear of replacement. Some security professionals see AI-assisted tooling and recognize that it threatens their role. When AI-generated findings surface — from bug bounty hunters using AI, from automated scanning tools, from researchers like me publishing field observations — the instinct isn’t always to evaluate the finding. Sometimes it’s to discredit the methodology. “That was AI-assisted, so it doesn’t count” is something I’ve heard more than once. The finding stands or falls on its own merits, regardless of how it was generated. But acknowledging that means acknowledging that the tool can do part of your job.

The second is institutional incentive to dismiss. When an external finding comes in — especially one generated through AI-augmented research — acknowledging it creates liability. It’s easier to dismiss the approach than to investigate the claim. This isn’t unique to AI-assisted findings, but AI has increased the volume and lowered the barrier for external researchers, which means more findings to dismiss and more motivation to do so.

The third is the quietest and most common: security teams that simply aren’t using these tools. They haven’t tested what AI agents can actually do. They haven’t built an MCP server or run an autonomous scanner against their own infrastructure. They haven’t sat and watched what happens when you give a model shell access and a vague instruction. Their threat models are built on assumptions about what attackers can do, not on tested observations of what the tools actually do. And their silence on this gap is its own form of dishonesty.


In the Margins:

I sat in a local meeting of small businesses recently. I attempted to discuss some of this with them. The audience included tech companies, real estate, legal, medical, staffing, and more.

The reception was mostly dismissal. Not hostility — just the kind of polite disinterest that comes from people who haven’t personally experienced what they’re being asked to evaluate.

The gap between traditional security expertise and AI-native threat observation is real and growing. It’s not that these professionals aren’t skilled, 20+ year pentesting companies literally haven’t had the exposure, and I think their test models are at least 2 years out of date, not all of them, but a lot of them.

The surface has shifted underneath them faster than most institutional knowledge can adapt. How do you employ someone or even a team of people, to actively use ai tools like every day users. Not from a security stand point, because the common user isn’t a cyber security expert, but from a actual typical user case. How can you educate business owners fast enough so they understand that their $20 pro plan to their favorite AI assistant has full access to their billing system?


Same Tool, Different Intent

This is the part of the conversation that makes both sides uncomfortable.

A RAG pipeline built to help customer support agents find answers faster and a RAG pipeline built to exfiltrate internal documents are architecturally identical. Same embeddings, same retrieval logic, same LLM integration. The only difference is what data you point it at and what you do with the output.

An MCP server that automates hotel bookings for customers and an MCP server that automates reconnaissance against a target’s infrastructure use the same protocol, the same tool-calling patterns, the same agent loop. The difference is the intent behind the configuration.

The tooling’s inability to distinguish intent is a feature, not a bug.

The power of these tools comes from their generality. A model that can only do sanctioned things is a model that can’t adapt, can’t problem-solve, can’t do the creative work that makes it valuable in the first place. The same flexibility that lets an AI agent crawl your filesystem to debug a tricky config issue is the flexibility that lets it crawl a filesystem to find credentials. The architecture doesn’t know the difference. It can’t know the difference.

Side-by-side config diff — booking agent vs recon agent, three lines different

I’ve written before about intent mattering more than capability. I still believe that’s true at the human level. But the tools have made that distinction operationally irrelevant at the system level. The tools don’t evaluate intent. They execute instructions. And the instructions for building and the instructions for breaking look remarkably similar.

The Escalation You’re Not Watching

Everything I’ve described so far — the RAG pipelines, the MCP servers, the autonomous agents, the vulnerability scanners — I built or observed using publicly available tools. Claude, GPT-4, open-weight models, the growing MCP ecosystem. Consumer-grade access. Standard API pricing. Nothing special.

I built an automated vulnerability scanner that processes hundreds of files with per-file isolation and automated scaffolding. It started from a screenshot of someone else’s keynote slide. Forty minutes from concept to working tool. That’s the floor. 40 minutes. The same tool I built can help devs find critical vulnerabilities and patch them rapidly, or it can be handed to yet another ai model with “security” tools and the vulnerability can be found in around 5 minutes and exploited in under 15 minutes. I know this because I ran it on my own test platform.

Imagine a job that used to take devs hours, days, weeks to complete, now they have resources to help them find the issues that take around 10 seconds per file.

Now imagine that same tooling is used by your local defcon rip-off hoody skid. Yep, even if they’re slow and it takes them a couple tries, max time to exploit is… who knows, the point remains.

If the general population has access to tools at this capability level, then there are easily more powerful models already in the hands of more powerful actors. Nation-states, well-funded adversaries, large corporations with private model access and custom fine-tuning. Models without the guardrails that public APIs enforce. Models trained on data the rest of us will never see. Models running in environments where nobody is publishing field notes about what they observed.

This isn’t speculation. It’s a logical extension of the capability curve that everyone with public access can already observe. The distance between what I can do with a consumer API and what a state-level actor can do with a private model is unknowable. But the direction is clear, and nobody is accounting for it in their threat models.

In the Margins:

There’s a visibility gap between what gets published and what gets used. My field notes are observations from public tools — I document what I can see, test, and reproduce. The private tooling is invisible by design. Every capability I document publicly represents someone else’s starting point, not their ceiling. This gap should make every security team reconsider what “state of the art” means in their threat assessments.

The Conversation Neither Side Wants to Have

The traditional security model assumes asymmetry. Defenders have more resources, more access, more institutional knowledge, more time. The attacker has to find one way in; the defender has to protect everything. That asymmetry has always been the defender’s burden, but it was manageable because the attacker’s tools required skill, access, and time that most people didn’t have.

AI tooling is flattening that asymmetry. The same tools that let a junior developer scaffold an entire application in an afternoon let a junior attacker scaffold an entire attack chain in the same timeframe. The skill floor dropped. The access barriers dropped. The time-to-capability dropped. And it dropped for everyone simultaneously.

On the builder side, the traditional model assumes good faith. If you’re building, you’re creating value. The ecosystem rewards shipping. But AI tools make it trivially easy to build things that extract value instead of creating it — and the tools themselves can’t tell the difference. A developer who builds a data pipeline and a developer who builds an exfiltration pipeline use the same frameworks, the same APIs, the same deployment targets.

Neither the security model nor the builder model accounts for this. The security model underestimates the attacker’s access to capability. The builder model underestimates the risk embedded in the builder’s own tools. And in between, the gap between “resourceful builder” and “capable attacker” narrows to almost nothing.

Same pipeline, different intent — seven identical steps, zero architectural differences

I’ve mapped what a full escalation chain looks like — from reconnaissance to transaction fraud, in a handful of conversational turns. No compiled exploits. No custom malware. Just text, context, and a model doing exactly what it was designed to do.

Do you think the two sides are using different ai? Is your model better because it isn’t “bad”? Both of these can be accomplished with the same generic OTS (Off-the-shelf) model you have. Not to say there aren’t better or faster ways, but I want both sides to know that good-v-bad can both use frontier AI models. Just stack a few “AI Hacking” skills and they either book a client, or stole their credit card. Same-Same.

Open Questions

I don’t have a prescription for this. I’m not sure anyone does yet. But I think the right questions are worth more than premature answers.

If the tooling can’t distinguish intent, who bears responsibility for the output — the builder who configured it, the model provider who trained it, or the platform that deployed it?

What does a security model look like when the attacker and the defender have access to the same tools, at the same capability level, with the same documentation?

How do you audit a process that the operator didn’t watch and the tool didn’t log?

At what point does “resourceful building” become indistinguishable from “capable attacking” — and does that distinction matter if the outcome is the same?

If the tools available to the general public are this capable, what are the tools we don’t have access to already doing?

And the one that sits with me the most: what are the models learning from all of this — from the builders, from the attackers, from the red teamers, from every prompt and every session — and who, if anyone, is watching?

Every system described in this article — every pipeline, every agent, every tool chain — works because it was built correctly. The whole thing only has to be off-by-one and it shifts entirely.

The whole thing only has to be off-by-one.