概述SLO 将可用性与延迟目标转化为可度量与告警。本文提供 Recording Rule 与多窗口 Burn Rate 告警示例,并给出路由与抑制策略与验证方法。录制规则(已验证)groups: - name: recording rules: - record: job:http_request_total:rate1m expr: rate(http_requests_total[1m]) - record: job:http_error_ratio:rate1m expr: rate(http_requests_errors_total[1m]) / rate(http_requests_total[1m]) 多窗口Burn Rate告警groups: - name: alerts rules: - alert: SLOErrorBudgetBurnFast expr: (rate(http_requests_errors_total[5m]) / rate(http_requests_total[5m])) > 0.14 for: 5m labels: { severity: critical, slo: availability } annotations: summary: 快速燃尽错误预算(5m窗口) - alert: SLOErrorBudgetBurnSlow expr: (rate(http_requests_errors_total[1h]) / rate(http_requests_total[1h])) > 0.02 for: 2h labels: { severity: warning, slo: availability } annotations: summary: 缓慢燃尽错误预算(1h窗口) 阈值示例基于 99.9% 可用性目标(需结合业务验证)。Alertmanager 路由与抑制route: group_by: [ alertname, job ] group_wait: 30s group_interval: 5m repeat_interval: 4h receiver: oncall routes: - matchers: - severity="critical" receiver: pager - matchers: - slo="availability" receiver: slo-channel receivers: - name: oncall slack_configs: - channel: '#oncall' - name: pager pagerduty_configs: - routing_key: '<redacted>' - name: slo-channel slack_configs: - channel: '#slo' inhibit_rules: - source_matchers: [ severity="critical" ] target_matchers: [ severity="warning" ] equal: [ alertname, job ] 验证与指标使用 `promtool check rules` 校验语法与录制规则;观测触发与抑制行为、重复间隔与路由命中;SLO 仪表盘:错误率、Burn Rate、剩余错误预算与趋势;常见误区仅使用单窗口告警,易漏报或误报;未区分严重等级与路由,造成噪音过高;无录制规则导致查询复杂与成本上升。结语以录制规则与多窗口 Burn Rate 为核心,配合 Alertmanager 的分组与抑制策略,实现可靠的 SLO 告警与可执行运维闭环。

点赞(0) 打赏

评论列表 共有 0 条评论

暂无评论
立即
投稿

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部
1.933571s