clickhouse-incident-runbook

Featured

ClickHouse incident response — triage, diagnose, and remediate server issues using system tables, kill stuck queries, and execute recovery procedures. Use when ClickHouse is slow, unresponsive, or producing errors in production. Trigger: "clickhouse incident", "clickhouse outage", "clickhouse down", "clickhouse emergency", "clickhouse on-call", "clickhouse broken".

AI & Automation 2,274 stars 319 forks Updated today MIT

Install

View on GitHub

Quality Score: 99/100

Stars 20%
100
Recency 20%
100
Frontmatter 20%
70
Documentation 15%
100
Issue Health 10%
50
License 10%
100
Description 5%
100

Skill Content

# ClickHouse Incident Runbook ## Overview Step-by-step procedures for triaging and resolving ClickHouse incidents using built-in system tables and SQL commands. ## Severity Levels | Level | Definition | Response | Examples | |-------|------------|----------|----------| | P1 | ClickHouse unreachable / all queries failing | < 15 min | Server down, OOM, disk full | | P2 | Degraded performance / partial failures | < 1 hour | Slow queries, merge backlog | | P3 | Minor impact / non-critical errors | < 4 hours | Single table issue, warnings | | P4 | No user impact | Next business day | Monitoring gaps, optimization | ## Quick Triage (Run First) ```bash # 1. Is ClickHouse alive? curl -sf 'http://localhost:8123/ping' && echo "UP" || echo "DOWN" # 2. Can it answer a query? curl -sf 'http://localhost:8123/?query=SELECT+1' && echo "OK" || echo "QUERY FAILED" # 3. Check ClickHouse Cloud status curl -sf 'https://status.clickhouse.cloud' | head -5 ``` ```sql -- 4. Server health snapshot (run if server responds) SELECT version() AS version, formatReadableTimeDelta(uptime()) AS uptime, (SELECT count() FROM system.processes) AS running_queries, (SELECT value FROM system.metrics WHERE metric = 'MemoryTracking') AS memory_bytes, (SELECT count() FROM system.merges) AS active_merges; -- 5. Recent errors SELECT event_time, exception_code, exception, substring(query, 1, 200) AS q FROM system.query_log WHERE type = 'ExceptionWhileProcessi...

Details

Author
jeremylongshore
Repository
jeremylongshore/claude-code-plugins-plus-skills
Created
7 months ago
Last Updated
today
Language
Python
License
MIT

Integrates with

Similar Skills

Semantically similar based on skill content — not just same category

AI & Automation Featured

clickup-incident-runbook

Execute ClickUp API incident response: triage, diagnosis, mitigation, and postmortem for API failures and integration outages. Trigger: "clickup incident", "clickup outage", "clickup down", "clickup on-call", "clickup emergency", "clickup API broken".

2,274 Updated today
jeremylongshore
AI & Automation Featured

databricks-incident-runbook

Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed".

2,274 Updated today
jeremylongshore
AI & Automation Featured

clickhouse-debug-bundle

Collect ClickHouse diagnostic data — system tables, query logs, merge status, and server metrics for support tickets and troubleshooting. Use when investigating persistent issues, preparing debug artifacts, or collecting evidence for ClickHouse support. Trigger: "clickhouse debug", "clickhouse diagnostics", "clickhouse support bundle", "collect clickhouse logs", "clickhouse system tables".

2,274 Updated today
jeremylongshore
AI & Automation Featured

hubspot-incident-runbook

Execute HubSpot incident response with triage, mitigation, and postmortem. Use when responding to HubSpot API outages, investigating CRM errors, or running post-incident reviews for HubSpot integration failures. Trigger with phrases like "hubspot incident", "hubspot outage", "hubspot down", "hubspot on-call", "hubspot emergency", "hubspot broken".

2,274 Updated today
jeremylongshore
AI & Automation Featured

snowflake-incident-runbook

Execute Snowflake incident response with triage, rollback, and postmortem using real SQL diagnostics. Use when responding to Snowflake outages, investigating query failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "snowflake incident", "snowflake outage", "snowflake down", "snowflake on-call", "snowflake emergency".

2,274 Updated today
jeremylongshore