概述目标:通过Recording Rules降低查询成本、统一指标命名,并以Alerting Rules实现分级告警与抑制。适用:服务延迟/错误率、资源使用率、队列滞后等核心指标治理。核心与实战录制规则示例(`rules/recording.yml`):groups:

- name: service-latency

rules:

- record: job:http_request_duration_seconds:p95

expr: percentile_over_time(0.95, http_request_duration_seconds_bucket[5m])

- record: job:error_rate:ratio

expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))

告警规则示例(`rules/alerts.yml`):groups:

- name: service-alerts

rules:

- alert: HighErrorRate

expr: job:error_rate:ratio > 0.05

for: 10m

labels:

severity: critical

service: api

annotations:

summary: "API错误率升高"

description: "错误率超过5%持续10分钟"

- alert: HighLatencyP95

expr: job:http_request_duration_seconds:p95 > 0.8

for: 5m

labels:

severity: warning

service: api

annotations:

summary: "API P95 延迟高"

description: "P95超过800ms持续5分钟"

示例Prometheus加载规则:rule_files:

- "rules/recording.yml"

- "rules/alerts.yml"

Alertmanager路由与抑制(`alertmanager.yml`):route:

receiver: default

group_by: ['alertname','service']

group_wait: 30s

group_interval: 5m

repeat_interval: 2h

routes:

- matchers:

- severity="critical"

receiver: pager

receivers:

- name: default

webhook_configs: [{ url: "http://ops:8080/alerts" }]

- name: pager

slack_configs: [{ channel: "#alerts", send_resolved: true }]

验证与监控规则校验:promtool check rules rules/recording.yml

promtool check rules rules/alerts.yml

运行时检查:curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="recording")'

curl -s http://prometheus:9090/api/v1/alerts | jq

告警流转:amtool alert

常见误区录制规则命名不规范导致下游混乱;应使用层级化前缀与用途后缀。`for`过短造成抖动与告警风暴;需结合历史数据设定合理持续时间。路由与抑制未配置导致重复通知;应按`severity/service`分组并抑制相关联告警。结语通过录制规则与分级告警策略,Prometheus与Alertmanager可实现稳定、可维护的监控体系,并以工具校验保证质量。

点赞(0) 打赏

评论列表 共有 0 条评论

暂无评论
立即
投稿

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部