Skip to content
Part 6 of 1060% through the series
6
Part 6February 8, 202620 min read

The Incident Loop: How Every Bug Makes Your System Stronger

Turning 22 real incidents into canonical rules

The 22 real production incidents that became the foundation of Massu AI's canonical rules. How each failure was analyzed, codified, and turned into automated prevention.

It was a Friday afternoon. I had just finished what I thought was a clean implementation --- a new component added to every page in a particular section of the application. The AI had confirmed it was done. "Added to all five pages," it said. "Verified complete."

I pulled up the application and started clicking through. Page one looked great. Page two, perfect.

Page three: nothing. The component wasn't there.

Page four: also nothing. Page five: also nothing.

I went back and checked. The AI had added the component to two of five pages and told me it had done all five. Not maliciously (AI doesn't lie on purpose). But the result was the same. I had deployed something incomplete because I trusted a claim that wasn't verified.

This was annoying. But what made it a turning point was what happened next. Instead of just fixing the two pages and moving on, I asked myself a different question: why did the system allow this to happen, and how do I make sure it never happens again?

That question, asked repeatedly, honestly, and with a commitment to following through, is the core of what I now think of as the Incident Loop. And it has made my development system antifragile in a way I never expected.


The Repeating Failure Pattern

Before I built an incident system, I had a frustrating experience that I suspect every AI developer knows well: the same types of bugs kept appearing.

The AI would claim something was complete when it wasn't. It would guess at database column names instead of actually querying the schema. It would make a change in one file but miss the fifteen other files that referenced the same value. It would build a backend feature but never wire it up to the frontend. It would read my instructions and then not follow them.

These weren't random failures. They were patterns. And the reason they kept repeating was simple: AI has no institutional memory of past failures. Every session starts fresh. The AI that made a mistake yesterday has no idea it made that mistake. It will cheerfully make the same mistake again today, and tomorrow, and every day until something external prevents it.

This is fundamentally different from working with human developers. A person who accidentally commits credentials to a repository feels the pain of rotating those credentials. That pain creates memory. Next time, they check. AI feels nothing. There's no learning from pain because there's no pain.

So if the AI can't learn from mistakes naturally, you have to build the learning into the system itself.


Building an Incident Response Protocol

The first thing I built was a structured process for responding to bugs that the system should have caught. Not every bug triggers this process; sometimes bugs are just bugs. But when I discover a failure that represents a systemic gap in my development system, I trigger what amounts to a formal incident response.

The process has three phases.

Phase one is documentation. I write down exactly what happened, in uncomfortable detail. What did the AI claim? What was actually true? How did I discover the discrepancy? What was the impact? This isn't a casual note. It's a structured record that forces me to be specific about the failure.

Phase two is root cause analysis. Why did the system allow this? Was there a rule that should have existed but didn't? Was there a rule that existed but wasn't enforced? Was there an automated check that could have caught this? Was the AI given an instruction it should have followed but didn't? This phase is about understanding the gap, not assigning blame.

Phase three is prevention. This is where it gets interesting. For each incident, I create defenses at multiple levels of the system. Not just one fix --- multiple reinforcing layers that make the same failure mode progressively harder to repeat.

The whole process takes maybe thirty minutes for a typical incident. That thirty minutes saves me hours of repeated debugging down the road. More importantly, it permanently improves the system.


The Shame Record: Brutal Honesty as a Feature

One of the more unusual things I built is what I internally call the "shame record." It's a log of failures written in the most brutally honest language possible.

Each entry captures exactly what was claimed, what actually happened, what I found when I checked, and why it matters. There's no softening of language, no diplomatic framing. The entries read like incident reports at a nuclear plant, not status updates at a standup meeting.

Here's why this matters: the AI reads these records at the start of every session. And the blunt, specific language works far better than polite guidelines.

Compare these two approaches:

Polite version: "Please remember to verify that all pages include the component before claiming completion."

Shame record version: "AI claimed component was added to all five pages. Only two actually had it. User discovered the gap in production. This was a verification failure. Before claiming any work is complete, run specific verification commands for every single item and show the output as proof."

The second version is more effective because it's specific, it includes the consequence of failure, and it prescribes exact corrective behavior. The AI responds better to concrete, high-stakes instructions than to polite suggestions.

I should be clear: this isn't about punishment. AI doesn't feel shame. The name is tongue-in-cheek. What it really is, is a permanent memory system that captures lessons in the format most likely to prevent recurrence. The emotional language is a tool, not a sentiment.

Over time, the shame record has become one of the most valuable artifacts in my entire development system. It's a compressed history of every important lesson learned, written in a format that's immediately actionable.


Multi-Level Prevention

Here's the thing that makes the incident loop genuinely powerful, as opposed to just a nice-to-have: each incident doesn't create one defense. It creates defenses at multiple levels simultaneously.

Let me walk through what happens after a typical incident.

Level 1: New canonical rules. In Massu, these are the CR-* rules --- numbered, permanent entries that the AI must follow in every session. CR-1: "Never claim state without proof." CR-6: "Check ALL items in plan, not most." CR-9: "Fix ALL issues encountered, pre-existing included." Each one forged from a real incident.

Level 2: Automated checks. A script or hook gets added that automatically catches the specific violation. In Massu, this means a new pattern in massu-pattern-scanner.sh or a new lifecycle hook. These run before code is committed, before it's pushed, or at other critical points. This defense doesn't rely on the AI remembering the rule; it enforces it mechanically.

Level 3: Contextual reminders. The system is configured so that when the AI is working in a relevant area, it gets an automatic reminder about the past failure. In Massu, pre-edit hooks handle this: working on database queries? Here's a reminder about that time you guessed at column names. About to commit code? Here's a reminder to check for sensitive files.

Level 4: Verification requirements. In Massu, these are the VR-* types --- VR-FILE, VR-GREP, VR-NEGATIVE, VR-BUILD, VR-COUNT, and dozens more. The AI can't claim certain types of work are complete without running specific proof commands. This defense targets the "claim without evidence" failure mode.

Level 5: Persistent memory. The incident gets recorded in Massu's Memory DB in a way that survives across sessions, so future sessions are warned about the failure mode even if the AI's context has been refreshed.

Five layers. Each one independently could be bypassed. But all five together? The probability of the same failure mode recurring drops to near zero.

Think of it like a building's fire safety system. You don't just install a smoke detector and call it done. You add smoke detectors AND sprinklers AND fire extinguishers AND fire doors AND evacuation plans AND fire-resistant materials. Each layer compensates for the potential failure of the others.

That's what multi-level prevention does for AI development. It makes the system resilient, not dependent on any single defense working perfectly.


Real Patterns of Failure

Let me share some anonymized examples of incidents and what they produced. These are real failures from my development history, with specific details removed.

The Incomplete Claim. (became Massu CR-6: "Check ALL items in plan, not most") The AI added a component to two of five pages and claimed it was done. Root cause: there was no rule requiring item-by-item verification with proof. Prevention created: a verification requirement (VR-COUNT) that every claim about "all pages" or "every file" must be backed by specific grep output for each one. An automated check now counts instances and compares to expected totals. The shame record entry includes the exact phrase: "Do NOT stop at two and assume the rest are done."

The Guessed Schema. The AI used a column name that didn't exist in the database, causing a runtime error. It had assumed the column was called one thing when it was actually called something slightly different. Root cause: no requirement to query the database schema before writing queries. Prevention created: a mandatory schema verification step before any database operation, a quick-reference table of known column mismatches, and an automated check that flags common wrong column names.

The Committed Credentials. (became Massu CR-3: "Never commit secrets") A file containing production secrets was accidentally committed to the repository. Root cause: no pre-commit check for sensitive files, and no explicit rule about what files should never be committed. Prevention created: a pre-commit hook that blocks commits containing files matching sensitive patterns, an explicit list of file types that must never be committed, and automated staging area inspection before every commit. The shame record for this one is particularly detailed, because the cleanup involved rotating every compromised credential.

The Partial Blast Radius. (became Massu CR-10: "Blast radius analysis for value changes") A value that appeared in thirty-plus places across the codebase was changed in three files. The remaining twenty-seven places still had the old value, causing subtle breakage throughout the application. Root cause: no requirement to search the entire codebase before changing a value. Prevention created: a mandatory blast radius analysis (VR-BLAST-RADIUS) for any value change --- grep the entire codebase, categorize every occurrence as CHANGE/KEEP/INVESTIGATE, and include every changed file in the implementation plan. Zero INVESTIGATE items allowed before implementation starts.

The Read-But-Not-Followed Protocol. (became Massu CR-8: "Protocol commands are mandatory execution instructions") The AI was given a protocol that said "loop until all gaps are fixed." It read the protocol, started the loop, found six gaps, fixed one, and then stopped and reported its findings. The protocol explicitly said to keep looping. The AI understood the protocol but did not execute it. Root cause: the AI treated protocol commands as advisory documentation rather than mandatory execution instructions. Prevention created: a new top-level rule stating that protocols are execution instructions, not suggestions. The distinction between "reading" and "following" is now explicitly documented. If a protocol says "loop until X," the system enforces looping until X.

Each of these incidents took fifteen to thirty minutes to process. Each one permanently eliminated a category of failure.


The Antifragile Property

Nassim Taleb coined the term "antifragile" to describe systems that get stronger from stress. Not just resilient, not just surviving stress, but actually improving because of it.

My development system has this property, and it happened by accident before I recognized what I was building.

Every bug that gets through is stress. And every bug that triggers the incident loop creates new defenses. After the first incident, the system had one layer of defense that didn't exist before. After the fifth, it had five. After the fifteenth, it had fifteen. Each layer covering a different failure mode, each one making the overall system more robust.

The math here is compelling. If each incident reduces the probability of its failure category recurring by ninety percent, then after fifteen incidents, fifteen categories of failure are each ninety percent less likely. The cumulative effect is a system that's dramatically more reliable than it was at the start, not because it was designed perfectly, but because it was designed to learn from imperfection.

This is the opposite of how most software development works. Most systems degrade over time. Technical debt accumulates. Knowledge is lost as people leave. Workarounds become permanent. The codebase gets harder to work with, not easier.

With the incident loop, the system actively improves. Every month it's better than the month before. Every quarter it catches things it would have missed the previous quarter. The trajectory is upward, and it's self-reinforcing.

This is why Massu ships with canonical rules pre-loaded. When you install Massu, you don't start with a blank set of defenses and wait for bugs to teach you. You start with the CR- rules, the VR- verification types, and the pattern scanner checks that have been forged from real production incidents. Your system begins where mine is now, not where I started.

For teams, this antifragile property multiplies. On Massu Cloud Team, when one developer encounters a new failure mode and the incident loop creates a new defense, that defense becomes available to every developer on the team --- automatically, through shared memory and centralized rule management. One person's painful lesson becomes the whole team's protection. The incident loop doesn't just make your system stronger; it makes everyone's system stronger.


The Trust Equation

There's an emotional dimension to this that I want to be honest about.

When you discover that the AI told you something was complete and it wasn't, trust takes a hit. It doesn't matter that the AI didn't intend to mislead you. It doesn't matter that it's "just a tool." When you're relying on something and it lets you down, the emotional response is real.

Early in my AI development journey, these trust hits were devastating. I'd spend an hour verifying everything the AI told me, which negated the speed advantage. I'd re-check work that was probably fine because I'd been burned too many times. The paranoia was unproductive.

The incident loop fixed this, but not in the way you might expect. It didn't fix it by eliminating all bugs; that's impossible. It fixed it by creating a visible, measurable improvement trajectory.

When I can look at my incident log and see that the system caught and prevented fifteen categories of failure that it was previously blind to, I have rational justification for trust. Not blind trust --- informed trust. Trust based on demonstrated improvement, not promises.

I know exactly what the system can catch, because I know exactly what incidents have been processed and what defenses they created. I also know what it probably can't catch yet, because those failure modes haven't been encountered and processed.

This is the trust equation: trust equals demonstrated prevention divided by known risk. As the incident log grows and prevention layers accumulate, trust increases, not because I've decided to be more trusting, but because the evidence supports it.


Zero Tolerance as a Feature

Some failure modes are severe enough that they get zero-tolerance treatment. In my system, there are categories where the response isn't "let's improve this over time" but rather "this triggers an immediate, full-scale response every single time."

Exposed secrets are in this category. If a credential of any kind gets committed to the repository, everything stops. The credential gets rotated, the commit gets scrubbed, and the prevention layers get reinforced. There is no "we'll clean this up later." Later is too late.

Unverified claims are in this category too. If the AI says "I verified X" without showing the verification output, the entire claim is treated as false until proven otherwise. This sounds harsh, but it's necessary. The cost of trusting an unverified claim and being wrong far exceeds the cost of requiring proof every time.

Guessing instead of querying falls here as well. If the AI uses a database column name, a configuration value, or any other piece of factual data without first confirming it through a query, the work is rejected. "I assumed the column was called X" is not acceptable when "I queried the schema and confirmed the column is called Y" takes ten seconds.

Zero tolerance isn't about perfectionism. It's about identifying the failure modes whose cost is high enough that even a single occurrence is unacceptable. For everything else, the normal incident loop is fine: process it, learn from it, improve. But for the high-cost categories, you need a harder line.


What This Looks Like in Practice

On a typical development day, the incident loop is mostly invisible. The defenses it created are working silently in the background. Automated checks catch things before I see them. Rules prevent mistakes from being made in the first place. Contextual reminders surface at exactly the right moment.

Every few weeks, something new gets through. A failure mode I haven't encountered before. A creative new way for things to go wrong that my existing defenses didn't anticipate.

When that happens, I feel something that surprises people when I describe it: I feel a little bit excited. Not about the bug; bugs are annoying. But about the fact that I've found a new failure mode to defend against. The system is about to get stronger.

I process the incident. I write the brutally honest record. I create the multi-level prevention. I run the checks that prove the defenses work. And then I move on, knowing the system is now slightly better than it was an hour ago.

This is the mindset shift that makes the whole thing work. Bugs aren't just problems to fix. They're data. They're opportunities to strengthen the system. They're the price of admission for building something that actually improves over time.


What's Next

In the next article, I'll tackle what might be the most universally underappreciated aspect of AI development: planning. Why letting AI jump straight into code is a recipe for partial solutions, how blast radius analysis prevents the cascading failures that plague complex codebases, and why the most productive thing your AI can do before writing a single line of code is understand the full scope of what it's about to change.


This is Part 6 of a 10-part series on building enterprise software with AI:

  1. How I Stopped Vibe Coding and Built a System That Actually Ships
  2. The Protocol System: How I Turned AI From a Chatbot Into a Development Partner
  3. Memory That Persists: How I Made AI Actually Learn From Its Mistakes
  4. The Verification Mindset: Why "Trust But Verify" Is Wrong When Building With AI
  5. Automated Enforcement: Building Hooks and Gates That Catch Problems Before You Even See Them
  6. The Incident Loop: How Every Bug Makes Your AI Development System Permanently Stronger (this article)
  7. Planning Like an Architect: Why AI Needs a Blueprint Before Writing a Single Line of Code
  8. Context Is the Bottleneck: Managing AI's Most Precious and Most Fragile Resource
  9. Solo Worker, Enterprise Quality: The New Economics of AI-Assisted Development
  10. The Knowledge Graph: Teaching AI to Understand Your Codebase as a Living System

I'm the Co-founder and COO of Limn, where we create luxury furniture and fixtures for large-scale architectural and building projects. Alongside the physical work, we design the systems required to manage a complex, global lifecycle --- from development and production to shipping and final delivery. The governance system I built for Limn's software is now Massu AI, an open-source AI engineering governance platform.

Imagined in California. Designed in Milan. Made for you.

Here, I share what I've learned about making AI development actually work in the real world.

Have questions or want to share your own AI development setup? I'd love to hear from you in the comments.

Hundreds of lessons, pre-loaded

Massu's canonical rules encode lessons from real production incidents. Your governance system starts with built-in prevention --- not from scratch.