概述目标:通过Recording Rules降低查询成本、统一指标命名,并以Alerting Rules实现分级告警与抑制。适用:服务延迟/错误率、资源使用率、队列滞后等核心指标治理。核心与实战录制规则示例(`rules/recording.yml`):groups:
- name: service-latency
rules:
- record: job:http_request_duration_seconds:p95
expr: percentile_over_time(0.95, http_request_duration_seconds_bucket[5m])
- record: job:error_rate:ratio
expr: sum(rate(http_requests_total{code=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))
告警规则示例(`rules/alerts.yml`):groups:
- name: service-alerts
rules:
- alert: HighErrorRate
expr: job:error_rate:ratio > 0.05
for: 10m
labels:
severity: critical
service: api
annotations:
summary: "API错误率升高"
description: "错误率超过5%持续10分钟"
- alert: HighLatencyP95
expr: job:http_request_duration_seconds:p95 > 0.8
for: 5m
labels:
severity: warning
service: api
annotations:
summary: "API P95 延迟高"
description: "P95超过800ms持续5分钟"
示例Prometheus加载规则:rule_files:
- "rules/recording.yml"
- "rules/alerts.yml"
Alertmanager路由与抑制(`alertmanager.yml`):route:
receiver: default
group_by: ['alertname','service']
group_wait: 30s
group_interval: 5m
repeat_interval: 2h
routes:
- matchers:
- severity="critical"
receiver: pager
receivers:
- name: default
webhook_configs: [{ url: "http://ops:8080/alerts" }]
- name: pager
slack_configs: [{ channel: "#alerts", send_resolved: true }]
验证与监控规则校验:promtool check rules rules/recording.yml
promtool check rules rules/alerts.yml
运行时检查:curl -s http://prometheus:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.type=="recording")'
curl -s http://prometheus:9090/api/v1/alerts | jq
告警流转:amtool alert
常见误区录制规则命名不规范导致下游混乱;应使用层级化前缀与用途后缀。`for`过短造成抖动与告警风暴;需结合历史数据设定合理持续时间。路由与抑制未配置导致重复通知;应按`severity/service`分组并抑制相关联告警。结语通过录制规则与分级告警策略,Prometheus与Alertmanager可实现稳定、可维护的监控体系,并以工具校验保证质量。

发表评论 取消回复