概述SLO 将可用性与延迟目标转化为可度量与告警。本文提供 Recording Rule 与多窗口 Burn Rate 告警示例,并给出路由与抑制策略与验证方法。录制规则(已验证)groups:
- name: recording
rules:
- record: job:http_request_total:rate1m
expr: rate(http_requests_total[1m])
- record: job:http_error_ratio:rate1m
expr: rate(http_requests_errors_total[1m]) / rate(http_requests_total[1m])
多窗口Burn Rate告警groups:
- name: alerts
rules:
- alert: SLOErrorBudgetBurnFast
expr: (rate(http_requests_errors_total[5m]) / rate(http_requests_total[5m])) > 0.14
for: 5m
labels: { severity: critical, slo: availability }
annotations:
summary: 快速燃尽错误预算(5m窗口)
- alert: SLOErrorBudgetBurnSlow
expr: (rate(http_requests_errors_total[1h]) / rate(http_requests_total[1h])) > 0.02
for: 2h
labels: { severity: warning, slo: availability }
annotations:
summary: 缓慢燃尽错误预算(1h窗口)
阈值示例基于 99.9% 可用性目标(需结合业务验证)。Alertmanager 路由与抑制route:
group_by: [ alertname, job ]
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receiver: oncall
routes:
- matchers:
- severity="critical"
receiver: pager
- matchers:
- slo="availability"
receiver: slo-channel
receivers:
- name: oncall
slack_configs:
- channel: '#oncall'
- name: pager
pagerduty_configs:
- routing_key: '<redacted>'
- name: slo-channel
slack_configs:
- channel: '#slo'
inhibit_rules:
- source_matchers: [ severity="critical" ]
target_matchers: [ severity="warning" ]
equal: [ alertname, job ]
验证与指标使用 `promtool check rules` 校验语法与录制规则;观测触发与抑制行为、重复间隔与路由命中;SLO 仪表盘:错误率、Burn Rate、剩余错误预算与趋势;常见误区仅使用单窗口告警,易漏报或误报;未区分严重等级与路由,造成噪音过高;无录制规则导致查询复杂与成本上升。结语以录制规则与多窗口 Burn Rate 为核心,配合 Alertmanager 的分组与抑制策略,实现可靠的 SLO 告警与可执行运维闭环。

发表评论 取消回复