Dashboard
System health overview · prod-us-east-1 · 11:14:30
3
Active Incidents
-2 vs last hour
2.1m
MTTD
↓ 40% vs last week
8.4m
MTTR
↓ 32% vs last week
99.97%
SLO Compliance
Target: 99.95%
82%
Error Budget
Remaining this month
96.8%
Change Success
Last 30 deploys
Active Incidents (3)View All →
INC-0042CRITICALINVESTIGATING14 min ago
payment-service P99 latency spike to 2.3s
Service: payment-service · Assignee: AI Agent (Diagnosing...)
🤖 AI Diagnosis: Correlated with DB connection pool exhaustion. Recent deploy v2.14.3 changed pool size from 50→30. Rollback recommended.
INC-0041WARNINGACKNOWLEDGED47 min ago
user-service elevated 5xx error rate (2.1%)
Service: user-service · Assignee: alex.chen@team
🤖 AI Diagnosis: Downstream Redis cluster node-3 had intermittent timeouts. Failover triggered at 14:22 UTC. Monitoring recovery.
INC-0040WARNINGMITIGATED2h 13m ago
notification-worker message backlog > 50K
Service: notification-worker · Assignee: sarah.li@team
🤖 AI Diagnosis: Kafka consumer lag due to burst of push notifications. Scaled workers 3→8. Backlog clearing at ~2K/min.
Service HealthAll Services →
| Service | Status | P99 | Error Rate | Replicas |
|---|---|---|---|---|
| api-gateway | healthy | 45ms | 0.01% | 8/8 |
| user-service | warning | 120ms | 2.1% | 6/6 |
| payment-service | critical | 2300ms | 0.8% | 4/4 |
| order-service | healthy | 62ms | 0.02% | 10/10 |
| notification-worker | warning | N/A | 0% | 8/8 |
| inventory-service | healthy | 38ms | 0.01% | 4/4 |
| search-service | healthy | 55ms | 0.01% | 6/6 |
| analytics-pipeline | healthy | N/A | 0% | 3/3 |
SLO ComplianceDetails →
| Service | SLO Target | Current | Budget |
|---|---|---|---|
| api-gateway | 99.95% | 99.99% ↑ | 92% |
| user-service | 99.9% | 99.95% ↓ | 68% |
| payment-service | 99.95% | 99.87% ↓ | 12% |
| order-service | 99.9% | 99.99% ↑ | 95% |
| notification-worker | 99.5% | 99.92% ↑ | 88% |
MTTR / MTTD Trend (7 days)↓ AI-driven improvement
Error Rate Trend (24h)Peak: 2.8% at 14:00
Recent AlertsAll Alerts →
14:28:15[Prometheus]WARNINGHigh CPU usage on node ip-10-0-4-23 (92%)aggregated
14:27:42[Datadog]CRITICALpayment-service P99 latency > 2s thresholdfiring
14:25:03[Grafana]WARNINGRedis cluster-3 connection timeout (3 occurrences)resolved
14:22:18[PagerDuty]WARNINGuser-service error rate > 1% for 5 minutesacknowledged
14:18:55[Prometheus]INFODisk usage on /data volume > 85%resolved
14:15:30[AWS CloudWatch]WARNINGRDS aurora-prod CPU utilization > 80%aggregated
14:10:12[Custom Webhook]INFOSSL certificate expiring in 7 days: api.opscapital.comacknowledged
Recent ChangesAll Changes →
| ID | Service | Type | Risk | Status |
|---|---|---|---|---|
| CHG-0187 | payment-service | deploy | 72/100 | deployed |
| CHG-0186 | user-service | config | 18/100 | deployed |
| CHG-0185 | api-gateway | deploy | 35/100 | deployed |
| CHG-0184 | search-service | infra | 45/100 | approved |
Nodes
46
1 warning
1 critical
of 48 total
Pods
1238
5 pending
4 failed
of 1247 total
CPU Usage
62%
Memory Usage
71%