AI Agents in Production Get Real by Getting Boring
The hype phase is over. AI agents in production now live or die on governance, isolation, observability, and trust.
You can spot the people who’ve actually shipped AI agents in production because they stop talking like they’ve seen the future and start talking like exhausted compliance officers. Suddenly it’s all sandboxing, policy layers, audit logs, rollback plans. Sexy stuff. Real “come over, I’ll show you my governance model” energy.
I’ve watched this happen over and over. Week one: “our agent can book meetings, write code, and update CRM records.” Week three: “wait, who let it touch Salesforce, GitHub, and billing in the same workflow?” Very different mood. The demo is espresso. Production is the health inspector showing up unannounced.
Everybody loves the fantasy version. Your agent does your job while you sip a Negroni in Navigli and pretend you’ve escaped labor. Bellissimo. Nobody posts the part where legal, security, and platform engineering kick the door open and start saying “blast radius” before lunch.
That shift is the story.
My hot take is simple: most teams still treat agents like a prompt engineering problem. They’re not. Or not mainly. They’re a boring infrastructure, security, and governance problem wearing a very expensive new outfit.
Honestly? Good.
Because the “agent revolution” doesn’t begin when your demo gets claps. It begins when your security team stops having a heart attack. The winners won’t be the teams with the cleverest prompts. They’ll be the teams that treat agents like unreliable junior employees with root access: fast, useful, occasionally brilliant, and absolutely not to be left alone.
I say that with love. And with scar tissue. I once watched a prototype agent take the shortest path to a task by surfacing internal data in a way that was technically efficient and spiritually horrifying. Nothing exploded. But I got that cold founder feeling in my stomach — the one that says, ah, okay, this is where the real work starts.
The demo was the easy part
We’re past “can agents do useful things?” Yes. They can summarize tickets, draft code, search docs, route requests, update records, and generally make a strong first impression. Mazel tov. That part is not the interesting question anymore.
The interesting question is whether they can do useful things without breaking the company.
That’s the real vibe shift around AI agents in production. Last year everybody was drunk on capability. This year the grown-ups are asking about reliability, observability, explainability, and security. Which is less fun, because it is less fun. It’s also the difference between a toy and something I’d actually let near a business.
InfoQ’s coverage of QCon AI Boston basically said the quiet part out loud: the conversation is moving away from novelty and toward the engineering work required to make AI systems reliable, observable, explainable, and secure in production. Good. About time.
Because the second an agent touches your CRM, your codebase, your support queue, or your internal docs, it stops being a cute experiment. It becomes part of your operating system. And operating systems are judged on bad days, not keynote days.
I’ve had versions of this conversation in Lisbon, New York, London, and one weirdly loud café in Mexico City where an engineering lead looked at me like he hadn’t slept since Claude 2. The pattern is always the same. Teams start by obsessing over model quality. Then the first weird edge cases hit, somebody asks “why did it do that?”, and now we’re talking access control, rollback plans, traceability, and monitoring.
That’s not failure. That’s adulthood.
Production agents become real the moment someone asks: can we explain this action, reproduce it, and shut it down safely if it starts acting weird? If the answer is no, congrats — you have a demo. Maybe a spicy one. Still a demo.
And I think a lot of teams know this already. They just don’t want to say it out loud because “we built a policy engine and an audit trail” doesn’t get the same applause as “our autonomous agent completed the workflow end to end.” One sounds like sci-fi. The other sounds like enterprise middleware. Guess which one survives procurement.
Your agent is not a genius. It’s an intern with credentials
This is the mental model I keep coming back to: your agent is not a genius. It’s an eager intern with too many tabs open and access to systems it does not understand nearly as well as you hope.
That’s not me being anti-agent. I’m very pro-agent. I just think the second you start anthropomorphizing these systems, smart people begin making dumb architecture decisions.
If your setup assumes the model will behave, your setup is bad. Full stop.
Because capability without constraint gets weird fast. Agents can be manipulated. They can leak data. They can take bizarre action chains because one tool call led to another and nobody bothered to define what “acceptable behavior” actually means beyond “please be useful.” Cute. Also negligent.
That’s why OpenAI buying Promptfoo was such a tell. TechCrunch reported the deal as a move to bring automated red-teaming, security evaluation, and risk/compliance monitoring into OpenAI’s enterprise agent platform. That is not the move of a company saying, “prompts are enough.” That is the move of a company saying, “oh wow, these things are about to touch serious workflows and we need guardrails yesterday.”
And Promptfoo wasn’t some tiny niche side project. TechCrunch said its tools were already used by more than 25% of Fortune 500 companies. Before the acquisition, it had raised $23 million and hit an $86 million valuation after its July 2025 round, per PitchBook data in the same report. Real money is chasing a very unsexy idea: if agents are going to act, you need to test them like they’re capable of doing stupid things at scale.
Because they are.
Red-teaming an agent is also not the same as testing a chatbot. A chatbot can say something wrong. Bad, yes. Annoying, absolutely. An agent can do something wrong. It can trigger a workflow, expose information, write garbage into a real repo, or create a chain of individually harmless decisions that turns into a postmortem with twelve Slack screenshots and one guy saying “to be fair, this was an edge case.”
That’s operator risk, not just language risk.
I’ll admit something mildly embarrassing: the first time I saw an agent complete a multi-step flow against a real internal toolchain, part of me felt awe and part of me felt fear. Not “AI will reshape civilization” fear. More like: if this thing gets one assumption wrong at 2:13 a.m., who gets paged? Usually the answer is some poor engineer who did not sign up to babysit a probabilistic coworker.
“We’ll add security later” is the AI version of “I’ll just have one drink.” It’s never one drink.
The real product is the leash
Here’s the least glamorous truth in this whole category: governance is becoming the actual product.
Not the chatbot skin. Not the orchestration canvas with the pretty arrows. The leash.
Harsh? Maybe. Still true. The teams that win with enterprise agents are not the ones giving every squad total freedom to invent its own rules. They’re the ones building centralized policy enforcement so nobody has to duct-tape fragile guardrails across twelve repos and call it a strategy.
The New Stack recently wrote about Galileo releasing Agent Control, an open-source control plane for writing behavioral policies once and enforcing them across agent deployments. That’s a big signal. The stack is professionalizing in real time. Policy is no longer the thing you remember after the incident review. It’s becoming a shared layer.
And yes, this is the least sexy part of the system. Nobody goes viral on X because they implemented centralized runtime governance. Nobody posts “just shipped real-time policy updates without downtime” with three fire emojis and a thread. But the boring parts are usually the parts keeping the company alive.
The scale argument alone should wake people up. The same New Stack piece cited IDC saying AI agent usage among Global 2000 organizations is expected to increase tenfold by 2027, while token and API call volumes could spike by 1,000x. If that’s even directionally right, the idea that every team can manually manage policies and controls in its own little sandbox is pure fantasy.
You need “write once, enforce everywhere” because entropy always wins. Every single time.
This gets messy fast in multi-agent setups — or honestly just in normal big-company chaos. One team is building a support agent. Another is building an engineering agent. Finance wants workflow automation. Suddenly nobody can answer basic questions like:
- Which actions are allowed?
- Which data classes can be touched?
- What happens when policy changes next week?
- Who can see what happened after the fact?
That’s why centralized governance, observability, and lifecycle management are becoming baseline infrastructure instead of enterprise fluff. The New Stack noted that Galileo’s platform supports real-time policy updates without downtime or code modifications. That sounds like a minor detail until you’ve worked inside an actual company. If tightening a rule requires every team to stop, debate, redeploy, and pray, your governance layer is going to lose every internal political fight.
And then the agent chaos starts.
Piano piano, then all at once.

If it can touch production, it belongs in a sandbox
My opinion here is not nuanced: if an agent can browse, execute actions, or interact with sensitive systems, it belongs in an isolated environment by default.
Not maybe. Not later. Not after the pilot gets approved.
Default.
People hear “microVM” and act like you’re being dramatic. But giving an autonomous system broad access without isolation is the digital version of leaving your Vespa unlocked in Naples and then acting shocked when it disappears. My nonna would have stronger language, but let’s keep this family-friendly-ish.
VentureBeat reported that Perplexity’s enterprise launch runs each agent session inside its own Firecracker microVM. That is exactly the kind of sentence I want to hear when somebody says they’re serious about enterprise deployment. The article framed it correctly: production-grade agents are as much an infrastructure problem as a model problem, especially when they touch sensitive data and operational systems.
That distinction matters because assistant risk and operator risk are not the same thing. If an assistant gives me a mediocre summary, I sigh and rewrite it. If an operator can click buttons, run workflows, or touch customer data, then every session becomes a blast-radius question.
What happens if it’s compromised? What happens if it’s manipulated? What happens if it just misunderstands the task in a very confident tone?
Isolation is how you keep one weird session from becoming everybody’s problem.
This is where permissions and zero-trust design stop sounding like security-team buzzkills and start sounding like basic adult supervision. The model should get the minimum access required. The execution environment should be constrained. Actions should be scoped, observable, and revocable. If that makes your architecture slightly less elegant in a demo, cry me a river. I’d rather have ugly safety than beautiful chaos.
The companies winning with agents aren’t letting them freelance
The best agent deployments I’ve seen are almost always the least cinematic.
They’re not fully autonomous miracle workers roaming the org chart like digital consultants. They’re tightly connected to internal standards, internal systems, review loops, compliance rules, and repositories. In other words, they’re useful because they’re embedded in how the company actually works, not because they’ve been set loose to improvise.
That’s why VentureBeat’s reporting on EY caught my eye. The firm reportedly hit 4x coding productivity by connecting AI agents to internal engineering standards, repositories, and compliance frameworks. That’s the part that matters. Not “the model wrote code,” but “the output became usable because it was grounded in organizational context and enforcement systems.”
That’s the difference between raw output and deployable output.
A lot of people still talk about context like it’s just retrieval. As if stuffing more docs into the prompt is the whole game. It’s not. Context also means standards, permissions, review criteria, repo conventions, approval paths, and all the invisible rules your team follows without thinking about them. If your agent doesn’t know those rules, it’s basically just autocomplete with a PR team.
And this is where a lot of pilots die. They look amazing in isolation, then collapse the second they’re asked to produce work that actually meets internal standards. Because AI agents in production are not judged on cleverness. They’re judged on whether the output can be trusted, reviewed, and shipped without creating more cleanup for humans.
Very different bar.
As a founder, I learned this the annoying way, which is also how I seem to learn everything. I used to think the magic was in what the model could generate. Now I think the value is in what the system can reliably constrain, verify, and route. Less cinema. More plumbing. More boring. More money.
And yes, that sounds profoundly unromantic.
Welcome to software.
AI agents in production reward boring teams
I don’t think the next winners will be the companies with the most autonomous agents. I think they’ll be the companies disciplined enough to make autonomy feel almost invisible.
That’s the paradox of AI agents in production. The real future probably looks less like sci-fi and more like policy engines, audit trails, observability layers, isolated runtimes, and carefully scoped permissions. Less “look what it can do.” More “look how safely it behaves when nobody’s watching.”
So that’s my test.
Not whether your agent can crush a heroic demo. Not whether it can go viral in a thread. But whether I’d trust it at 2:13 a.m. on a Sunday, when it still has access to customer data, production systems, and my reputation.
That’s when you find out whether you built an agent.
Or just a very expensive hallucination with API keys.