Agentic AI for Autonomous Root-Cause Analysis

in Large-Scale Enterprise Systems

A deep dive into multi-agent AI systems for autonomous root-cause analysis in large-scale enterprise IT environments.

odsc

multi-agent systems

root-cause analysis

enterprise IT

observability

causal reasoning

diagnostic agents

retrieval agents

evaluation agents

analysis agents

investigation graph

Agentic AI for Autonomous Root-Cause Analysis in Large-Scale Enterprise Systems

Nik Kale
Cisco Systems

Notes

Topic: A presentation by Nik Khale on multi-agent artificial intelligence systems for autonomous root-cause diagnosis in enterprise IT environments.
Core problem:
- Modern observability tools are good at detecting that something happened.
- They are much weaker at explaining why it happened.
- In outages, teams often enter “war room” mode, where network, database, application, security, and infrastructure teams each defend their own domain.
- The main bottleneck is not missing data, but cross-domain causal reasoning.
Key argument:
- Alerts, metrics, logs, configuration diffs, and telemetry are usually available.
- The hard task is separating the true root cause from cascading symptoms.
- Traditional correlation and timestamp matching are insufficient because the same event appears differently across domains.
Limits of existing approaches:
- Schema consolidation: dumping everything into a data lake fails because schemas drift and correlation is not causation.
- War rooms: they can work, but expert reasoning is not preserved; once the call ends, the diagnostic memory disappears.
- Expert systems: rule-based playbooks work only for known cases and fail when the environment changes.
Proposed shift:
- Move from deductive expert systems to inductive multi-agent systems.
- Instead of following only pre-written rules, agents generate hypotheses, retrieve evidence, evaluate causality, reject weak explanations, and iterate toward a root cause.
Architecture described:
- Diagnostic agent: generates candidate hypotheses from a problem statement.
- Retrieval agent: gathers relevant evidence from logs, metrics, telemetry, configuration, databases, or live APIs.
- Evaluation agent: judges whether evidence supports or contradicts each hypothesis and identifies causal dependencies.
- Analysis agent: converges on the root cause, assigns confidence, and recommends remediation steps.
Investigation loop:
- The system usually runs several iterations, often three to nine, sometimes more.
- Each iteration expands the search, retrieves evidence, evaluates hypotheses, and prunes explanations that are unsupported or merely symptomatic.
- Multiple reasoning loops can run in parallel and compare agreement or disagreement.
Graph-based reasoning model:
- The investigation state is represented as a machine-readable directed acyclic graph.
- Nodes include:
  - problem nodes,
  - hypothesis nodes,
  - evidence nodes,
  - rejected hypothesis nodes,
  - root-cause nodes.
- Edges represent:
  - causal relations,
  - evidential support,
  - generated-from relations,
  - dependency/pruning relations.
Why the graph matters:
- It is not just a visualization or report artifact.
- It is the computational state of the investigation.
- It preserves the full reasoning history, making the result auditable and explainable.
Pruning mechanism:
- If hypothesis B depends on hypothesis A, then B cannot be the root cause.
- Example: a network symptom may depend on a pod eviction, meaning the network issue is not primary.
- Unsupported hypotheses are also removed.
- This allows the system to reduce a large search space into a smaller causal chain.
Retrieval strategy:
- The system does not rely only on a vector database or semantic search.
- It chooses the retrieval method suited to the evidence:
  - keyword search for known terms,
  - term frequency–inverse document frequency for anomaly hunting,
  - semantic search for fuzzy matching,
  - Structured Query Language for structured data,
  - live queries or APIs for real-time state.
- The point is tool selection, not one universal retrieval method.
Example investigation:
- A checkout or voting application becomes unreachable.
- The system considers network, Kubernetes, application, and infrastructure hypotheses.
- It traces the problem across domains: application error → pod eviction → resource limits → infrastructure issue.
- In one example, the root cause is disk pressure on a worker node.
- In the demo, another root cause is a specific network interface, VM 22, being administratively taken down.
Demo features:
- The user interface shows:
  - probable root cause,
  - investigation graph,
  - iteration slider,
  - rejected hypotheses,
  - evidence artifacts,
  - audit trail,
  - downloadable investigation report.
- The audit trail records what was checked, what was rejected, and how the final conclusion was reached.
Outcome claimed:
- The system completed an investigation in roughly five minutes.
- It avoided a traditional bridge call or war room.
- It produced both the root cause and recommended resolution steps.
Trust and adoption:
- The speaker notes that building the system is only one challenge.
- Getting an organization to trust autonomous diagnostic reasoning is a separate challenge.
- The talk begins to discuss trust boundaries, but this section is cut short because the session runs out of time.
Main takeaway:
- The presentation argues that the next step in observability is not better alerting, but auditable causal reasoning.
- Multi-agent systems can preserve expert-like troubleshooting as a structured, reusable, machine-readable investigation graph.

Reflection

Citation

BibTeX citation:

@online{bochman2026,
  author = {Bochman, Oren},
  title = {Agentic {AI} for {Autonomous} {Root-Cause} {Analysis}},
  date = {2026-04-28},
  url = {https://orenbochman.github.io/posts/2026/04-30-ODSC-AI-2026-Day-3/talk10.html},
  langid = {en}
}

For attribution, please cite this work as:

Bochman, Oren. 2026. “Agentic AI for Autonomous Root-Cause Analysis.” April 28. https://orenbochman.github.io/posts/2026/04-30-ODSC-AI-2026-Day-3/talk10.html.