Enterprise AI Platform  ·  DevOps  ·  SRE

Intelligent
DevOps & Pipeline Triage

AI Triage Tool — an AI-powered diagnostics platform that cuts MTTR by automating Kubernetes log analysis, GitLab pipeline triage, runbook generation, and incident intelligence at enterprise scale.

0%
Faster MTTR
0
K8s Clusters Unified
0
Integration Methods
0%
AI-Driven Analysis
// 01   WHY IT MATTERS

The Modern DevOps Crisis

Enterprise Kubernetes and GitLab pipeline environments generate millions of log lines and pipeline failure events. Manual triage takes hours. AI Triage changes that equation.

Alert Fatigue

On-call engineers receive hundreds of alerts daily. Without intelligent prioritisation and root cause analysis, critical issues get buried in noise at 3 AM.

Context Switching Hell

Diagnosing failures across Primary and Secondary clusters and GitLab pipelines requires navigating multiple kubectl contexts, Splunk dashboards, and Confluence wikis — costing 45+ minutes per incident.

Knowledge Silos

Runbooks scatter across wikis and chat threads. AI Triage uses Claude AI to synthesise context from live pods, GitLab pipeline jobs, APM traces, and source code — generating actionable runbooks in seconds.

// 02   SYSTEM ARCHITECTURE

How It All Connects

A unified platform connecting four user personas, three integration methods, and five data sources through a single AI-powered engine.

USERS
👷 DevOps Eng.Infra & Ops
🔭 SRE TeamReliability
💻 DeveloperApp Teams
⚙️ CI/CD PipelineGitLab Runner
INTEGRATION METHODS
🌐 Web Browser UIPort 5000 · Next.js
⚡ REST API/api/* endpoints
🔧 CLI / Scriptscurl · GitLab Runner
CORE PLATFORM
🛡️ AI TRIAGE TOOLMicrosoft Entra ID SSO Authentication
ANALYSIS MODES
💬 Chat InterfaceLog Analysis
☸️ K8s Live PodsPod Logs
📊 Splunk O11yAPM Traces
🏢 ET PlatformSplunk Logs
DATA SOURCES & AI ANALYSIS
🤖 Claude AIAnthropic
☸️ Primary ClustersSTG / PLB / QAT
🔀 Secondary ClustersDEV / STG
🦊 GitLabSource Code
📡 Splunk MCPLog Store
OUs & NAMESPACES
OU-01ou01-dev, ou01-qat…
OU-07ou07-dev, ou07-qat…
OU-08ou08-dev, ou08-qat1…
OU-04ou04-dev, ou04-plb…
// 03   AUTHENTICATION FLOW

Microsoft Entra ID SSO

Zero-trust OAuth2 via Microsoft Entra ID. Every request is session-validated with 2-hour rolling cookies and bearer token introspection.

USER
BROWSER
TRIAGE TOOL
ENTRA ID
SESSION STORE
Navigate to Tool URL
GET /
↺ Check Session
[alt] NO VALID SESSION
Redirect /auth/login
No Valid Session
GET /auth/login
OAuth2 Authorization
🔑 Enter Credentials
↺ Validate User
Authorization Code
GET /auth/callback?code=xxx
Exchange Code → Token
Access Token + User Info
Create Session (2hr)
Saved
Set Session Cookie
✓ AUTHENTICATED
Authenticated Request → Main Chat Interface
Use Tool Features
// 04   API INTEGRATION FLOW

External App Integration

Machine-to-machine REST API. External applications authenticate with Entra ID OAuth Bearer tokens, then submit log payloads for AI analysis.

API INTERACTION SEQUENCE
EXTERNAL APP
TRIAGE API
INFRA (K8s/AI/DB)
Get OAuth Access Token from Entra ID
→ Bearer Token issued
GET /api/health
← 200 OK {"status":"healthy"}
POST /api/analyze
Authorization: Bearer <token>
{"logs":"…","context":{cluster,ns,pod}}
Validate Token & introspect
Fetch Pod Context (optional K8s API)
Send logs + context → Claude AI Analysis
Diagnosis + Runbook ← Claude AI
Save Conversation → PostgreSQL
200 OK
{"analysis":"…","runbook":"…","severity":"high"}
GET /api/conversations → History
ENDPOINT REFERENCE
GET/api/health
POST/api/analyze
GET/api/conversations
GET/api/conversations/{id}
AUTH/auth/login → /auth/callback
SAMPLE REQUEST
POST /api/analyze
Authorization: Bearer <entra_token>
Content-Type: application/json

{
  "logs": "OOMKilled: container exceeded...",
  "context": {
    "cluster": "cluster-stg",
    "namespace": "ou01-dev",
    "pod": "api-server-7d9f"
  }
}
SAMPLE RESPONSE
{
  "analysis": "Pod restarted due to OOM...",
  "runbook": "1. Check mem limits\n2. ...",
  "severity": "high",
  "conversation_id": "conv_abc123"
}
// 05   CI/CD PIPELINE INTEGRATION

Zero-Touch Incident Response

GitLab pipeline failures automatically trigger log collection, AI analysis, and multi-channel runbook dispatch — no human pager required.

BUILD
stage
TEST
stage
DEPLOY
stage
✓ SUCCESS
deploy done
or
✕ FAILURE
auto-triage ↓
── AUTO-REMEDIATION PIPELINE ──
STEP 1
kubectl logs
Get Pod Logs
Set Cluster Context
STEP 2
curl POST
/api/analyze
Bearer Token Auth
STEP 3
Claude AI
AI Triage Tool
Diagnosis + Runbook
NOTIFICATIONS
📨 Post to Slack with Runbook
🎫 Create Jira Ticket
📧 Alert Team via Email
// 06   MULTI-CLUSTER INFRA ACCESS

Cross-Cluster Intelligence

A single authentication layer spans Primary and Secondary Kubernetes clusters with namespace-aware RBAC across all OUs — OU-01 through OU-04.

👷 User: DevOps Engineer
🛡️ AI Triage Tool
Select Cluster / OU
AUTHENTICATION LAYER — Centralized Microsoft Entra ID / AAD Provider
PRIMARY CLUSTERS
STG — cluster.stg.primary
ou01-devou01-qatou07-dev ou07-qatou08-devou08-qat1
PLB — cluster.plb.primary
ou01-plbou07-plbou08-plb
QAT — cluster.qat.primary
ou01-qatou07-qatou08-qat
SECONDARY CLUSTERS (INFRA LAYER)
SECONDARY DEV — cluster.dev.secondary
ou04-devou04-plb ou05-devou05-qat
SECONDARY STG — cluster.stg.secondary
ou04-stgou04-plb-stgou05-stg
INFRA LAYER AUTH
IL NamespaceSSO AuthAAD Provider
OU-01 GROUP
ou01-dev · ou01-qat
ou01-plb
OU-07 GROUP
ou07-dev · ou07-qat
ou07-plb
OU-08 GROUP
ou08-dev · ou08-qat1
ou08-plb
OU-04 GROUP
ou04-dev · ou04-plb
ou04-stg
// 07   TECHNOLOGY STACK

Built with Enterprise-Grade Tech

AI & ANALYSIS
Anthropic Claude AI
Splunk O11y
APM Traces
INFRASTRUCTURE
Kubernetes
kubectl
Docker
GitLab CI/CD
BACKEND & API
Python / FastAPI
PostgreSQL
Redis
JWT / OAuth2
AUTH & FRONTEND
Microsoft Entra ID
Next.js
TypeScript
TailwindCSS
// 08   BUSINESS IMPACT

Why This Changes Everything

0%
Reduction in MTTR
0min
Saved Per Incident
0
K8s Clusters Unified
0%
Auto-Generated Runbooks

"AI Triage Tool represents the next evolution of DevOps intelligence — where AI doesn't just assist engineers, it autonomously triages, diagnoses, and prescribes remediation at machine speed."

$5.6T
Global IT Downtime Cost
Enterprises lose billions annually to unplanned downtime. AI-driven triage directly addresses this.
73%
of SREs Cite Alert Fatigue
More than three-quarters of reliability engineers are overwhelmed by noisy, low-signal alerts every day.
4.8x
Faster Incident Resolution
Teams using AI-assisted triage resolve P1 incidents 4.8× faster than those relying on manual processes.