Agent Governance — how to manage a swarm of 50 AI agents without losing control

Mar 18
7 min read

Series: CDF 1.3.2 in practice — 6 articles on the methodology of sovereign AI implementation

This is the fourth article in the series (G2). In previous issues: Cognitive SLA (S1) — reasoning quality metrics; From pilot to production (G1) — eliminating Pilot Purgatory; Sovereignty Level Assessment (S2) — the right level of sovereignty. The GENESIS series focuses on agent governance, scaling, and operations. CDF 1.3.2 is a proprietary methodology developed by allclouds.pl, based on ISO/IEC 42001:2023 and the EU AI Act.

Launching a single AI agent is easy. Launching five requires coordination. But when an organization operates dozens of agents — with different roles, levels of autonomy, budgets, and dependencies — a completely different problem arises. Not a technological one, but a management one.

CDF 1.3.2 calls this state "agent sprawl": the uncontrolled proliferation of agents without a central registry, cost limits, or contingency procedures. Agent sprawl is not a matter for the future. It is a problem that is already appearing in organizations — when the first agent success encourages the rapid launch of more agents, without considering who is responsible for them and what will happen if something goes wrong.

Why autonomy without governance is an operational risk

AI agents differ from classic software in one fundamental way: they make decisions. They do not perform a deterministic sequence of steps, but assess the situation, choose an action, and execute it — sometimes on their own, sometimes by delegating tasks further.

This makes traditional software management tools—infrastructure monitoring, CI/CD, ticketing—necessary but insufficient. They do not answer questions such as: Who can assign tasks to this agent? What is its monthly inference cost limit? What happens when two agents take conflicting actions? Who has the authority to immediately disable an agent in an emergency?

Without answers to these questions, the organization loses control not over the infrastructure, but over what happens on that infrastructure. And the more agents there are, the faster the risk of conflict, cost escalation, and cascading errors grows.

Agent Registry — nine fields that change everything

The foundation of Agent Governance in CDF is the Agent Registry — a central, mandatory Phase 2 registry of all agents in the system. This is not an optional "best practice." It is a required architecture artifact without which the implementation cannot proceed to the next phase.

Each agent in the registry must have nine fields defined:

Field	What it describes
Agent ID	Unique technical identifier
Roles	Business role in the ecosystem — e.g., Analyst, Reviewer, Orchestrator
Autonomy Level	Autonomy level from L0 to L4
Owner	Natural person operationally responsible for the agent
Delegated By	Who — a person or a superior agent — can assign tasks
Token Budget	Monthly hard limit on inference costs
Kill-Switch Authority	Persons authorized to emergency shut down the agent — minimum two
Dependencies	List of agents and systems dependent on the output of this agent
Last Audit	Date of the last certification of correctness and security

There are a few things to note. The owner is a specific individual, not a team or a role. The token budget is a hard limit, not an approximate value. Kill-switch authority requires at least two authorized persons. And the Dependencies field forces explicit mapping of who is affected by the shutdown or failure of a given agent.

This means that an organization cannot "just launch" another agent. It must know who is responsible for it, how much it may cost, who will disable it in an emergency, and what will happen to the agents that depend on it.

Three patterns of interaction between agents

When there is more than one agent, the question of how they work together arises. CDF 1.3.2 defines three interaction patterns — Agent Interaction Protocols — each with a different risk profile and purpose.

Pattern	How it works	Acceptable level	Risk profile
Hierarchical	The supervisor delegates tasks to workers. Single point of escalation.	L1–L2, critical processes	Lowest risk — default pattern
Collaborative	Agents share a workspace and arrive at a solution together.	L2–L3, auxiliary processes	Higher risk of decision conflict
Pipeline	Sequential transfer of results in the agent chain.	Deterministic, linear processes	Low — fixed order of steps

CDF leaves no choice of pattern unsecured. Every machine-to-machine interaction is strictly logged in the Immutable Audit Trail — with a timestamp, initiator, cost in tokens, and final result. This means that even in the most complex multi-agent environment, it is possible to reconstruct who commissioned what, how much it cost, and what the result was.

Token Budget Governance — cost control at the agent level

One of the most practical elements of Agent Governance in CDF is Token Budget Governance. Each agent is assigned a monthly hard limit on inference costs expressed in tokens.

Why is this so important? Because in a multi-agent environment, costs can escalate in ways that are difficult to predict. An agent that delegates tasks to other agents generates not only its own costs, but also cascading costs — each query to a subordinate agent means more tokens. Without hard limits, a single misconfigured workflow can generate a month's worth of planned costs in a matter of hours.

Token Budget is not a cost report after the fact. It is a preventive mechanism — an agent who exhausts their budget cannot continue without a conscious decision to increase the limit. This forces planning and accountability, rather than reactively looking for someone to blame after receiving an invoice.

The Token Budget of individual agents is a component of the Production Cost Model, which CDF requires already in the planning phase. We write more about how CDF structures the costs of transitioning from pilot to production in the article "From pilot to production in 90 days."

Three emergency procedures

CDF 1.3.2 defines three levels of emergency intervention in agent systems. Each corresponds to a different threat scenario.

Single Agent Kill-Switch — immediate shutdown of one specific agent. Hardware or software. Dependent nodes are notified and the system automatically switches to manual verification. Used when one agent is experiencing problems but the rest of the ecosystem is functioning properly.

Swarm Kill-Switch — stops the entire swarm of agents operating in a given process. Cascading suspension of side operations and triggering of business continuity procedures. Used when the problem affects not a single agent, but an entire cooperating group.

Cognitive Circuit Breaker — automatically cuts off an agent from external systems after three consecutive failed cognitive quality verification tests. This mechanism works without human intervention — an agent that fails the quality test three times loses access to the systems before it can propagate erroneous decisions further.

This three-step approach is well thought out. The organization has a tool for surgical intervention at the level of a single agent, to stop the entire multi-agent process, and to automatically protect against a cascade of errors — without waiting for human response.

Emergency procedures are directly linked to Cognitive SLA escalation. The Red level — e.g., Hallucination Rate >5% in a critical process for 7 days — automatically triggers the Kill Switch. We describe the full Yellow/Orange/Red escalation procedure in the article "Cognitive SLA — why 99.9% uptime is not enough."

Conflict Resolution — what if agents disagree

In a multi-agent environment, sooner or later a situation will arise in which two agents take contradictory actions or generate mutually exclusive recommendations. This is not an edge case — it is a natural effect of probabilistic systems operating in a complex environment.

CDF addresses this problem directly by including Conflict Resolution in Agent Interaction Protocols. The rules for resolving decision conflicts in swarms must be defined in advance, not improvised at the moment of crisis. In the Hierarchical model, conflicts are resolved by the Supervisor. In the Collaborative model, escalation to a human or higher level of orchestration is required.

The lack of such rules is one of the most underestimated risks in multi-agent systems. Without them, two agents may simultaneously attempt to modify the same resource, order conflicting actions, or block each other — and the organization will only learn about it from the side effects.

Agent lifecycle — from design to retirement

Agent Governance in CDF is not just about managing agents in production. It is about managing their entire life cycle. CDF defines six phases: Design, Build, Test, Deploy, Monitor, Optimize/Retire.

The last option is particularly important. The Agent Retirement Protocol ensures the safe withdrawal of an agent from the ecosystem with dependency checks and knowledge transfer. In practice, this means that you cannot simply shut down an agent without verifying who depends on it, what data it processes, and whether its operational knowledge has been passed on.

The Monitor and Optimize/Retire phases are part of Cognitive Operations — a continuous maintenance service described in the article "Cognitive Operations — what happens after implementation, when most AI providers have long since left the building."

Connection to Cognitive SLA

Agent Governance does not operate in isolation — it is directly linked to Cognitive SLA. The Agent Coordination metric measures the percentage of multi-agent tasks completed without escalation, with a target of ≥85%. The Red escalation procedure can trigger the Agent Kill-Switch. Cognitive Circuit Breaker automatically responds to a decline in cognitive quality.

This means that governance and reasoning quality are part of the same management system in CDF. It is not enough to have well-managed agents if their responses are inaccurate. And it is not enough to have accurate responses if agents operate without controls, limits, and contingency procedures.

When an agent escalates a decision to a human, the question of the quality of that oversight arises. CDF solves this with the Human Competence Gate, a mechanism that verifies that the human approving the AI recommendation actually understands what they are approving. A full description of HCG can be found in the article "Human Competence Gate — how to turn fictional human oversight of AI into a real control mechanism."

What to ask when building a multi-agent system

If an organization is building or planning a multi-agent system, it is worth checking:

Is there a central registry of all agents with assigned owners?
How many of the nine mandatory Agent Registry fields does the current system have?
Does each agent have a hard limit on inference costs?
Who is authorized to emergency shut down an agent—and are there at least two such persons?
Are there defined rules for resolving conflicts between agents?
What are the procedures for withdrawing an agent from production—and do they include dependency checks?
Is every machine-to-machine interaction logged with cost, initiator, and result?

If the answers to most of these questions are "not yet" or "I don't know," then the multi-agent system is operating in a mode that CDF would describe as an architecture without governance. With five agents, this may work. With fifty, it becomes a source of operational, cost, and compliance risk.

Agent Governance is not a layer of bureaucracy imposed on technology. It is a mechanism that allows autonomy to be scaled without losing control — and which means that "we have 50 agents" signifies operational maturity rather than uncontrolled chaos.

Governance defines who is responsible for what. But how can you ensure that the person approving an AI recommendation really understands what they are approving? In the next article, we describe the Human Competence Gate — a mechanism that turns "clicking Yes" into a real verification of decision-making competence.

1 Comment

Marcin Kaźmirak

Apr 02

Opisane w tym artykule funkcjonalności, to jedne z najważniejszych cech całej metodologii CDF. W praktyce, firmy potrafią szybko odpalić kilkanaście agentów, ale nikt nie myśli o tym kto za nich odpowiada i co się stanie jak zaczną sobie nawzajem przeszkadzać. Podejście z twardym rejestrem agentów i budżetami tokenowymi to coś co realnie pozwala ogarnąć rosnącą infrastrukturę agentową zanim wymknie się spod kontroli. Bo bez tego wystarczy jeden źle skonfigurowany workflow żeby wygenerować koszty za pół kwartału albo wypuścić na zewnątrz sprzeczne wyniki które podkopią wiarygodność firmy.