Architectural Patterns

For Building and Governing Production-Grade Multi-Agent Systems

A deep dive into Dr. Ali Arsanjani’s talk at ODSC AI 2026
on architectural patterns for building and governing production-grade multi-agent systems.

odsc

talk

agents

Architectural Patterns for Building and Governing Production-Grade Multi-Agent Systems

Notes

Speaker and topic
- Dr. Ali Arsanjani of Google Cloud presents architectural patterns for building, governing, and scaling production-grade multi-agent AI systems.
- The talk focuses on agent maturity, orchestration, governance, self-improvement, security, cost control, and operational resilience.
Basic agent architecture
- Modern agents use a large language model as the reasoning core.
- Agents sense or retrieve information from structured data, unstructured data, digital business systems, and sometimes physical devices.
- Access to internal tools and resources is often handled through the Model Context Protocol (MCP).
- Agents receive broad goals rather than rigid robotic process automation-style workflows.
- They reason, use memory, plan steps, call tools, and may coordinate with other agents through agent-to-agent (A2A) protocols.
Agent maturity levels
- Early systems use static function calls tied to a single large language model.
- More advanced systems dynamically choose tools at runtime.
- Single-agent systems may use reasoning-action-observation-reflection loops for iterative self-correction.
- Multi-agent systems introduce specialized sub-agents coordinated by a root agent.
- Higher maturity systems use a meta-agent to enforce policies, resolve conflicts, adjust plans, and govern behavior.
- The most advanced systems are self-improving agent ecosystems with multi-agent learning and swarm-like feedback.
Security and isolation
- Open-source multi-agent frameworks can reduce integration overhead, but should not be deployed naively.
- Production deployments need zero-trust infrastructure, sandboxing, identity controls, environment isolation, and defense across pre-action, in-action, and post-action phases.
- Agents should be isolated by design, for example with Docker sandboxes and restricted host file-system access.
Skills as modular agent behavior
- Instead of loading huge prompts with many system instructions, agents can use modular “skills” defined in markdown files.
- Skills allow agents to load only the procedural capabilities needed for a task.
- This reduces prompt bloat and supports more maintainable agent behavior.
Frameworks and deployment
- The speaker highlights Google’s Agent Development Toolkit 2.0 as a strong framework for graph-based workflows, shared memory, and multi-agent development.
- Other frameworks mentioned include LangGraph, LangChain, and CrewAI.
- Agents may be deployed on Gemini Enterprise Agent Engine, Cloud Run, or Google Kubernetes Engine.
- Enterprise-facing agents can be surfaced through Gemini Enterprise apps for broader organizational access.
Memory architecture
- Agents need both short-term and long-term memory.
- Session memory supports current interactions.
- A longer-term memory bank extracts salient topics from previous sessions and restores them later for personalization and continuity.
Hybrid planner-scorer architecture
- Self-improving systems benefit from separating generation from evaluation.
- A planner generates candidate solutions, optimized for creativity and breadth.
- A scorer evaluates those solutions using a quality rubric.
- The scoring rubric acts as a contract between agents and defines what “good” means for the domain.
Custom evaluation metrics
- Generic benchmarks are insufficient for domain-specific work such as legal contracts, loan servicing, or regulated workflows.
- Teams should define programmatic quality metrics tied to business, customer, and regulatory criteria.
- A golden dataset and scoring function should be developed with domain experts.
- These metrics function similarly to reward functions for continuous agent improvement.
Preference-controlled synthetic data
- When human-labeled examples are scarce, teams can generate preference pairs.
- One output is produced under conditions likely to yield a good answer, another under weaker conditions.
- These outputs can be labeled as chosen versus rejected and used to train scorers or preference models.
Advanced tuning
- The talk mentions supervised fine-tuning, parameter-efficient fine-tuning, Low-Rank Adaptation, and Direct Preference Optimization.
- Preference-based tuning helps align model behavior with desired outputs rather than merely imitating examples.
Co-evolved agent training
- A static scorer may become obsolete as the planner improves.
- The planner and scorer should improve together.
- Planner outputs train a better scorer, and scorer feedback trains a better planner.
- This creates a virtuous cycle that can compound system performance.
Adversarial testing and red teaming
- A dedicated red-team agent can probe the main system for jailbreaks, biases, edge cases, and failures.
- This should be proactive rather than only reactive.
- For critical tasks, adversarial testing may run on every task execution.
- For lower-risk tasks, it can run periodically through scheduled sampling.
- The resulting adversarial examples should feed back into preference datasets and co-evolution pipelines.
Tokenomics and cost management
- Self-improving systems can create runaway token costs if not controlled.
- A system-level monitor should track token use across agents.
- Agents should have iteration limits, budgets, and automatic pause or scale-down mechanisms.
- Cost control must be balanced against measurable return on investment.
Business value measurement
- Agent systems should be evaluated not only by token cost or technical metrics but by business outcomes.
- Relevant metrics include resolution rate, cost per incident, customer satisfaction, and first-call deflection.
- Dashboards should link operational agent behavior to business key performance indicators.
- Teams should start with one or two core metrics and expand gradually.
Robustness and fault tolerance
- Production systems need resilience against service failures, crashes, timeouts, and unreliable agents.
- Five robustness patterns are highlighted:
  - Adaptive retry: measure successful retries after initial failures.
  - Watchdog timeout: track timeout violations per hour.
  - Auto-healing: log successful restarts after crashes.
  - Trust decay: track rolling failure rates per agent.
  - Fallback model: compare fallback-model accuracy against primary-model accuracy.
Overall message
- Production-grade multi-agent systems require more than orchestration.
- They need governance, modular skills, secure deployment, memory, custom evaluation, adversarial testing, cost controls, business-value tracking, and fault-tolerance mechanisms.
- The central architectural shift is from isolated agents toward governed, measurable, self-improving agent ecosystems.

Citation

BibTeX citation:

@online{bochman2026,
  author = {Bochman, Oren},
  title = {Architectural {Patterns}},
  date = {2026-04-28},
  url = {https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk11.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2026. “Architectural Patterns.” April 28. https://orenbochman.github.io/posts/2026/04-28-ODSC-AI-2026-Day-1/talk11.html.