HolmesGPT/holmesgpt
Open-source AI agent for investigating production incidents and finding root causes in cloud-native environments.

HolmesGPT is a CNCF sandbox project that provides an autonomous agent system for Site Reliability Engineering. It uses LLMs to investigate production incidents by querying various data sources including Kubernetes, VMs, cloud services, databases, and SaaS platforms. The agent features operator mode for continuous 24/7 background monitoring with automatic Slack alerts and GitHub PR creation for fixes. It includes petabyte-scale data handling with server-side filtering to keep large payloads out of LLM context windows.