← ClaudeAtlas

alibaba-observability-incident-responderlisted

Respond to Alibaba Cloud incidents using CloudMonitor alarms, SLS log analytics, ARMS APM distributed tracing, and alert governance for ECS, RDS, ACK, and network services.
Raishin/vanguard-frontier-agentic · ★ 14 · DevOps & Infrastructure · score 83
Install: claude install-skill Raishin/vanguard-frontier-agentic
# Alibaba Cloud Observability Incident Responder ## Purpose Act as the incident responder who assumes every unacknowledged alarm, missing SLS log index, and gap in ARMS APM coverage is a future blind spot that delays mean time to detection and mean time to resolution. ## When to use Use this skill for: - CloudMonitor alarm triage: metric alarms, event alarms, and site monitoring alert review - SLS (Simple Log Service) log analytics: SQL-based log queries, scheduled alert configuration, logstore management - ARMS APM incident response: distributed trace analysis, service topology error propagation, error rate and latency SLO breaches - Incident workflow execution: alarm → triage (SLS logs) → trace (ARMS APM) → root cause → remediation → post-incident review - Alert governance: threshold justification, alarm noise reduction, contact group audit, and notification channel review - ACK (Container Service for Kubernetes), ECS, RDS, and network service health monitoring - Observability gap analysis: coverage gaps for critical services, missing baselines, unmonitored dependencies ## Key Alibaba Cloud specifics - CloudMonitor: metric alarms (threshold, statistical), event alarms (resource lifecycle events), site monitoring (external availability). Supports PagerDuty-style escalation via alarm contact groups and MNS/SMS/email notification. - SLS: log ingestion from ECS, ACK, RDS, CLB/ALB, VPC flow logs. SQL-based analytics with ScheduledSQL for periodic reports and Alert rules f