概述SLO 将可用性与延迟目标转化为可度量与告警。本文提供 Recording Rule 与多窗口 Burn Rate 告警示例,并给出路由与抑制策略与验证方法。录制规则(已验证)groups:

- name: recording

rules:

- record: job:http_request_total:rate1m

expr: rate(http_requests_total[1m])

- record: job:http_error_ratio:rate1m

expr: rate(http_requests_errors_total[1m]) / rate(http_requests_total[1m])

多窗口Burn Rate告警groups:

- name: alerts

rules:

- alert: SLOErrorBudgetBurnFast

expr: (rate(http_requests_errors_total[5m]) / rate(http_requests_total[5m])) > 0.14

for: 5m

labels: { severity: critical, slo: availability }

annotations:

summary: 快速燃尽错误预算(5m窗口)

- alert: SLOErrorBudgetBurnSlow

expr: (rate(http_requests_errors_total[1h]) / rate(http_requests_total[1h])) > 0.02

for: 2h

labels: { severity: warning, slo: availability }

annotations:

summary: 缓慢燃尽错误预算(1h窗口)

阈值示例基于 99.9% 可用性目标(需结合业务验证)。Alertmanager 路由与抑制route:

group_by: [ alertname, job ]

group_wait: 30s

group_interval: 5m

repeat_interval: 4h

receiver: oncall

routes:

- matchers:

- severity="critical"

receiver: pager

- matchers:

- slo="availability"

receiver: slo-channel

receivers:

- name: oncall

slack_configs:

- channel: '#oncall'

- name: pager

pagerduty_configs:

- routing_key: '<redacted>'

- name: slo-channel

slack_configs:

- channel: '#slo'

inhibit_rules:

- source_matchers: [ severity="critical" ]

target_matchers: [ severity="warning" ]

equal: [ alertname, job ]

验证与指标使用 `promtool check rules` 校验语法与录制规则;观测触发与抑制行为、重复间隔与路由命中;SLO 仪表盘:错误率、Burn Rate、剩余错误预算与趋势;常见误区仅使用单窗口告警,易漏报或误报;未区分严重等级与路由,造成噪音过高;无录制规则导致查询复杂与成本上升。结语以录制规则与多窗口 Burn Rate 为核心,配合 Alertmanager 的分组与抑制策略,实现可靠的 SLO 告警与可执行运维闭环。

点赞(0) 打赏

评论列表 共有 0 条评论

暂无评论
立即
投稿

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部