By the time you read this, I will have written hundreds of rules governing how AI writes code in my system. Rules about database patterns, security requirements, UI consistency, naming conventions, serialization quirks, and dozens of other things that have bitten me in production.
And I need to tell you something uncomfortable: the AI ignores them constantly.
Not maliciously. Not even consistently. It reads the rules, acknowledges them, follows them for a while, and then, somewhere around the third hour of a complex session, starts violating them like they never existed. A rule I wrote in blood after a production incident, a rule the AI literally quoted back to me twenty minutes ago, just... forgotten.
This is the gap between rules and compliance. And closing that gap is what this article is about.
Rules are necessary. Without them, you get chaos. But rules alone are suggestions. What turns suggestions into guarantees is enforcement: automated systems that catch violations at the moment they happen, without requiring a human to be watching.
The Hook Lifecycle: Where Enforcement Lives
Modern AI coding tools expose lifecycle events --- specific moments in the development workflow where you can inject custom logic. Think of them as tripwires strung across the AI's path. Every time it does something, it has to pass through one of these checkpoints.
The lifecycle events I care about fall into a few categories:
Before actions. The AI is about to edit a file, run a command, or make a change. You can inspect what it's about to do and either allow it, block it, or inject context before it proceeds.
After actions. The AI just edited a file or ran a command. You can inspect what changed and flag problems immediately.
Session boundaries. A session is starting, ending, or the AI is about to stop working. You can load context, persist state, or perform final checks.
Context events. The AI's working memory is getting compressed or summarized. You can ensure critical information survives the compression.
Each of these is an enforcement opportunity. Miss one, and you've left a gap in your defenses. Wire them all up, and you've built something that's genuinely hard to sneak bad code past, even for an AI that isn't trying to sneak anything.
Massu AI ships with eleven hooks pre-configured across all four categories: session initialization hooks that load context and surface warnings, pre-edit hooks that inject domain-specific reminders, post-edit hooks that scan for known anti-patterns, pre-commit hooks that run the pattern scanner, and context-compression hooks that trigger recovery protocols. All eleven are ready to use out of the box --- no configuration required for the defaults, with full customization available through massu.config.yaml. A conventions config system makes the entire enforcement infrastructure portable across projects --- controlling the .claude directory name, session state paths, knowledge categories, and exclude patterns, so teams can adopt the system without restructuring their existing codebase.
Preventive Hooks: Stopping Problems Before They Start
The most valuable hooks fire before the AI does something. Prevention is always cheaper than detection.
Here's a concrete example. My system has a database client with specific conventions. You use one accessor pattern, not another. There's a table that looks like it should be called one thing but is actually called something else. There are serialization requirements for certain data types that, if ignored, cause silent 500 errors in production.
When the AI is about to edit a file in the database layer, a pre-action hook fires. It doesn't block the edit; that would be too aggressive. Instead, it injects a reminder: "You're editing a database-related file. Remember: use the correct client accessor. Remember: this table is called X, not Y. Remember: BigInt values need explicit conversion on return."
This sounds trivial. It's not. The difference between the AI having these reminders fresh in its working memory versus buried somewhere in a rules document it read forty-five minutes ago is the difference between correct code and a production bug.
Another preventive hook watches for security violations. Before any command runs, the hook inspects it. Is it trying to commit a file that matches a sensitive pattern (environment files, key files, credential files)? Block it. Hard stop. Don't even let it proceed. This particular hook exists because of a real incident where secrets were committed to a repository, requiring hours of credential rotation. That will never happen again, not because I'm more careful, but because the system won't physically allow it.
Preventive hooks also handle context injection at session start. When the AI begins a new session, hooks automatically load the current session state, surface past failures relevant to the current task, and warn about unresolved issues from previous sessions. The AI doesn't start from a blank slate; it starts with the institutional memory of every session that came before.
This matters more than people realize. Without session-start hooks, every new session is an amnesiac starting fresh. With them, it's more like an experienced colleague picking up where they left off.
Detective Hooks: Catching What Slipped Through
Prevention can't catch everything. Some problems only become visible after the code is written. That's where post-action hooks come in.
After every file edit, my system runs a series of checks against the changed content. These aren't full test suites; those come later. These are fast, targeted pattern matches looking for known anti-patterns.
Did the AI just use the wrong function call in a router file? Flag it immediately. Did it use an include statement that the database client silently ignores? Flag it. Did it add a public mutation endpoint without authentication? Flag it. Did it import a heavy dependency in a file that runs on the edge runtime? Flag it.
Each of these checks corresponds to a real bug that made it to production at some point. The post-action hook is the scar tissue from that wound. It fires within seconds of the edit, while the AI still has full context about what it was trying to do. The correction is immediate and cheap.
The key design principle is speed. Post-action hooks need to be fast enough that they don't disrupt the AI's workflow. A hook that takes thirty seconds to run will be perceived as friction and worked around. A hook that takes two seconds and delivers a precise, actionable message becomes a natural part of the development flow.
I keep post-action hooks focused on high-confidence violations: things that are always wrong, not things that might be wrong. If a check has a high false positive rate, it gets demoted to a warning or moved to a different layer of defense. More on that later.
Pattern Scanning: The Regex Safety Net
Beyond individual hooks, I run a comprehensive pattern scanner before code is committed. This is a dedicated script that checks the entire codebase, or at least the changed portions of it, against dozens of known violation patterns.
The scanner is conceptually simple: a collection of regex patterns, each associated with a severity level and an explanation. Some patterns are blocking; if they match, the commit is rejected. Others are warnings that get logged but don't stop progress.
Here's what the scanner catches, in broad strokes:
Structural violations. Using the wrong client, importing forbidden modules, using patterns that have been explicitly deprecated.
Security violations. Public mutations without authentication, exposed credentials, prototype pollution vectors.
Consistency violations. Using inline implementations of utilities that have canonical versions, mixing naming conventions, using deprecated CSS patterns.
Silent failure patterns. Code that won't crash but will silently do the wrong thing. These are the worst kind of bugs because they look like they're working until someone notices the data is wrong. Things like relation queries that get silently ignored, or BigInt values that get serialized incorrectly.
In Massu, this is massu-pattern-scanner.sh --- a script that runs in about two seconds and catches things that would take a human reviewer minutes to spot, if they spotted them at all. It's not intelligent. It doesn't understand code semantics. It just matches patterns. But the patterns encode months of hard-won knowledge about what goes wrong in real codebases, and that makes them remarkably effective. Massu's scanner ships with eight built-in check categories, and you can add your own patterns as your codebase develops its own failure modes.
One underappreciated benefit of pattern scanning: it creates a living inventory of your codebase's failure modes. When I look at the scanner configuration, I see a history of everything that's ever gone wrong, organized by category and severity. It's documentation and enforcement in the same artifact.
Auto-Review: Don't Let AI Grade Its Own Homework
This is the concept I'm most proud of, and the one that catches the most problems.
When the AI finishes a piece of work and claims it's complete, the system doesn't just take its word for it. In Massu, running /massu-review spawns a separate AI agent --- a fresh instance with no shared context from the implementation session --- and asks it to review the changes across seven dimensions: patterns, security, architecture, website compliance, AI-specific concerns, performance, and accessibility.
This is critical: the reviewing agent didn't write the code. It has no ego investment in the implementation. It has no accumulated context that might cause it to overlook things the way the implementation agent does. It comes in cold, reads the changes, and applies a critical eye.
There's an important refinement here: not all review tasks require the same level of AI reasoning. I've started using different model tiers for different types of review agents. Adversarial agents that make judgment calls (security reviewers, architecture reviewers, plan auditors) use the most capable models available, because the cost of a missed security vulnerability or a flawed architectural decision far exceeds the cost of a more expensive model. Meanwhile, agents that do comparison and search work (pattern compliance checkers, output quality scorers, schema validators) use more efficient models, because their task is well-defined and doesn't require the same depth of reasoning. This model tier approach keeps review costs manageable without compromising quality where it matters most.
The reviewing agent has a specific mandate: find problems. It checks that the code follows established patterns. It verifies that claimed completions are actually complete. It looks for edge cases the implementation agent might have missed. It checks that UI changes are actually wired up and rendered, not just created as orphaned components.
Why does this work so well? Because of a fundamental quirk of AI behavior: the implementation agent is in "builder" mode. It's focused on making things work, which creates a confirmation bias toward seeing things as working. The reviewing agent is in "critic" mode. It's focused on finding problems, which creates the opposite bias. You want both perspectives, and you can't get them from the same agent in the same session.
The auto-review catches roughly one significant issue per major feature implementation. Sometimes it's a missing error state. Sometimes it's a component that was created but never imported. Sometimes it's a subtle deviation from the patterns document. These are things I would not catch in manual review because I'm not as close to the patterns as the automated reviewer is.
The fresh-eyes principle applies to humans too, of course. Code review exists in traditional development for exactly this reason. The difference is that I don't have a team of human reviewers. The AI can be its own reviewer, as long as you enforce the separation between the agent that builds and the agent that reviews.
Defense in Depth: No Single Layer Is Enough
If you've been paying attention, you'll notice I haven't described a single enforcement mechanism. I've described five: rules, pre-action hooks, post-action hooks, pattern scanning, and auto-review. This is intentional.
No single layer catches everything. Rules are comprehensive but get forgotten during long sessions. Pre-action hooks are timely but limited to the specific triggers you've configured. Post-action hooks catch known patterns but miss novel violations. Pattern scanning is thorough but regex-based, so it can't understand intent. Auto-review is intelligent but adds overhead and occasionally misses things the pattern scanner would catch.
Layer them together, and the holes in each layer are covered by the others.
I think of it as a series of progressively finer filters --- all of which Massu implements out of the box:
- Rules (coarsest) ---
massu.config.yamland canonical rules establish what "correct" looks like - Pre-action hooks --- remind and prevent before changes happen (3 of Massu's 11 hooks)
- Post-action hooks --- detect violations immediately after changes (3 of Massu's 11 hooks)
- Pattern scanner ---
massu-pattern-scanner.shruns comprehensive structural checks before commit - Auto-review ---
/massu-reviewspawns an independent agent for 7-dimension semantic review - Manual verification (finest) --- human judgment on the final product
A violation that slips past the rules might get caught by a hook. One that slips past the hooks might get caught by the scanner. One that slips past the scanner might get caught by auto-review. And one that slips past everything automated still has to pass manual verification.
In practice, I rarely catch problems in manual verification anymore. Not because I've become less rigorous, but because the automated layers have become so effective that there's not much left for me to find.
Managing False Positives: The Noise Problem
Automated enforcement has a dark side: noise. If your hooks fire too aggressively, they become background noise that gets ignored. If your pattern scanner has too many false positives, people (and AI agents) learn to dismiss its warnings.
This is a real problem, and I've had to iterate on it significantly. Early versions of my enforcement system were too aggressive. Every file edit triggered a wall of reminders. The pattern scanner flagged code that was technically correct but matched a violation pattern. The AI would spend more time responding to false alarms than writing code.
The fix came from tiering. I split violations into three categories:
Blocking: things that are always wrong, no exceptions. Using the wrong database client. Committing secret files. Public mutations without auth. These stop everything until they're fixed.
Warning: things that are usually wrong but sometimes intentional. These get flagged but don't stop progress. The AI is expected to acknowledge them and either fix the issue or explain why it's correct in this case.
Informational: things worth noting but not worth interrupting flow for. These get logged for the next review pass but don't surface during active development.
Getting the tier assignments right is an ongoing process. A new check usually starts as blocking until I understand its false positive rate. If it fires too often on correct code, it gets demoted to warning. If it's mostly noise, it drops to informational or gets removed entirely.
I also added kill switches for rapid iteration. When I'm in the middle of a complex refactor and I know the intermediate states will violate patterns that the final state won't, I can temporarily suppress specific categories of checks. The key word is "temporarily": kill switches auto-expire, so I can't accidentally leave them on.
The lesson is that enforcement only works if people trust it. An enforcement system that cries wolf gets disabled. An enforcement system that's precise and helpful becomes something you rely on.
What I Actually See Day-to-Day
Let me paint a picture of what this looks like in practice, because it might sound more complex than it feels.
I start a new session. The session-start hook fires automatically, loading my current task context and surfacing two warnings: a database migration from yesterday hasn't been deployed to the staging environment, and the last session ended with uncommitted changes in a specific file. I wasn't aware of either issue. Without the hook, I would have started working on something new and discovered these problems later, at a worse time.
I ask the AI to implement a new feature. It starts editing files. Pre-action hooks fire silently in the background, injecting reminders when it touches sensitive areas of the codebase. I don't see these unless something gets blocked. The AI writes code that follows the patterns, partly because the rules are loaded and partly because the hooks are reinforcing them at the right moments.
The AI finishes and says the feature is complete. Post-action hooks have already flagged one issue: a new router function is missing the authentication wrapper. The AI fixes it immediately. Then the pattern scanner runs and catches a subtle issue, an import that would cause problems in the edge runtime. Fixed.
Finally, the auto-reviewer spawns, reads through all the changes, and flags two things: an error state that's missing a retry button, and a component that's exported but never imported in any page file. Both are fixed. The feature ships clean.
Total time added by enforcement: maybe five minutes. Time saved by not debugging these issues in production: immeasurable.
The Bigger Picture
Automated enforcement changed my relationship with AI-assisted development. Before hooks and gates, I was the quality system. Every bug was my fault for not catching it. Every pattern violation was my failure to remind the AI. I was the weakest link in my own development process, because I'm human, and humans miss things.
After enforcement, quality is systemic. The system catches things I would miss. It enforces consistency I couldn't maintain manually. It remembers every mistake ever made and ensures that mistake's specific failure mode can never recur.
This isn't about distrusting AI. It's about building a system where trust is verified automatically, where compliance is structural rather than behavioral, and where the cost of a mistake is a two-second hook firing rather than a production incident.
If you're building with AI and you don't have automated enforcement, you're relying on the AI's good intentions and your own vigilance. Both will fail you eventually. Build the gates. Wire up the hooks. Let the machines enforce the standards so you can focus on building the product.
Or, install them. Massu ships with all six layers pre-configured and calibrated through hundreds of real development sessions. The hooks have been tuned to minimize false positives. The pattern scanner has been refined against real codebases. The tiering between blocking, warning, and informational has been tested in production. You don't need to spend months building this infrastructure from scratch --- you can start with a system that's already been through the iteration I described in this article.
What's Next
In the next article, I'll talk about the incident loop: how every bug that makes it through all these defenses becomes a permanent improvement to the system. Enforcement catches known problems. The incident loop is how unknown problems become known ones.
This is Part 5 of a 10-part series on building enterprise software with AI:
- How I Stopped Vibe Coding and Built a System That Actually Ships
- The Protocol System: How I Turned AI From a Chatbot Into a Development Partner
- Memory That Persists: How I Made AI Actually Learn From Its Mistakes
- The Verification Mindset: Why "Trust But Verify" Is Wrong When Building With AI
- Automated Enforcement: Building Hooks and Gates That Catch Problems Before You Even See Them (this article)
- The Incident Loop: How Every Bug Makes Your AI Development System Permanently Stronger
- Planning Like an Architect: Why AI Needs a Blueprint Before Writing a Single Line of Code
- Context Is the Bottleneck: Managing AI's Most Precious and Most Fragile Resource
- Solo Worker, Enterprise Quality: The New Economics of AI-Assisted Development
- The Knowledge Graph: Teaching AI to Understand Your Codebase as a Living System
I'm the Co-founder and COO of Limn, where we create luxury furniture and fixtures for large-scale architectural and building projects. Alongside the physical work, we design the systems required to manage a complex, global lifecycle --- from development and production to shipping and final delivery. The governance system I built for Limn's software is now Massu AI, an open-source AI engineering governance platform.
Imagined in California. Designed in Milan. Made for you.
Here, I share what I've learned about making AI development actually work in the real world.
Have questions or want to share your own AI development setup? I'd love to hear from you in the comments.