AI SREv1.0.0-beta
LIVE
Last updated: 2s ago

Dashboard

System health overview · prod-us-east-1 · 11:14:30

3
Active Incidents
-2 vs last hour
2.1m
MTTD
↓ 40% vs last week
8.4m
MTTR
↓ 32% vs last week
99.97%
SLO Compliance
Target: 99.95%
82%
Error Budget
Remaining this month
96.8%
Change Success
Last 30 deploys
Active Incidents (3)View All →
INC-0042CRITICALINVESTIGATING14 min ago
payment-service P99 latency spike to 2.3s
Service: payment-service · Assignee: AI Agent (Diagnosing...)
🤖 AI Diagnosis: Correlated with DB connection pool exhaustion. Recent deploy v2.14.3 changed pool size from 50→30. Rollback recommended.
INC-0041WARNINGACKNOWLEDGED47 min ago
user-service elevated 5xx error rate (2.1%)
Service: user-service · Assignee: alex.chen@team
🤖 AI Diagnosis: Downstream Redis cluster node-3 had intermittent timeouts. Failover triggered at 14:22 UTC. Monitoring recovery.
INC-0040WARNINGMITIGATED2h 13m ago
notification-worker message backlog > 50K
Service: notification-worker · Assignee: sarah.li@team
🤖 AI Diagnosis: Kafka consumer lag due to burst of push notifications. Scaled workers 3→8. Backlog clearing at ~2K/min.
Service HealthAll Services →
ServiceStatusP99Error RateReplicas
api-gatewayhealthy45ms0.01%8/8
user-servicewarning120ms2.1%6/6
payment-servicecritical2300ms0.8%4/4
order-servicehealthy62ms0.02%10/10
notification-workerwarningN/A0%8/8
inventory-servicehealthy38ms0.01%4/4
search-servicehealthy55ms0.01%6/6
analytics-pipelinehealthyN/A0%3/3
SLO ComplianceDetails →
ServiceSLO TargetCurrentBudget
api-gateway99.95%99.99%
92%
user-service99.9%99.95%
68%
payment-service99.95%99.87%
12%
order-service99.9%99.99%
95%
notification-worker99.5%99.92%
88%
MTTR / MTTD Trend (7 days)↓ AI-driven improvement
Error Rate Trend (24h)Peak: 2.8% at 14:00
Recent AlertsAll Alerts →
14:28:15[Prometheus]WARNINGHigh CPU usage on node ip-10-0-4-23 (92%)aggregated
14:27:42[Datadog]CRITICALpayment-service P99 latency > 2s thresholdfiring
14:25:03[Grafana]WARNINGRedis cluster-3 connection timeout (3 occurrences)resolved
14:22:18[PagerDuty]WARNINGuser-service error rate > 1% for 5 minutesacknowledged
14:18:55[Prometheus]INFODisk usage on /data volume > 85%resolved
14:15:30[AWS CloudWatch]WARNINGRDS aurora-prod CPU utilization > 80%aggregated
14:10:12[Custom Webhook]INFOSSL certificate expiring in 7 days: api.opscapital.comacknowledged
Recent ChangesAll Changes →
IDServiceTypeRiskStatus
CHG-0187payment-servicedeploy72/100deployed
CHG-0186user-serviceconfig18/100deployed
CHG-0185api-gatewaydeploy35/100deployed
CHG-0184search-serviceinfra45/100approved
Nodes
46
1 warning
1 critical
of 48 total
Pods
1238
5 pending
4 failed
of 1247 total
CPU Usage
62%
Memory Usage
71%