概览与核心价值Prometheus 提供强大的时序监控与规则计算能力。通过 Recording/Alert 规则与 Alertmanager,可实现服务可用性、延时与错误率的可靠监控与告警。关键规则示例Recording Rules(聚合加速)groups: - name: recording.rules rules: - record: job:http_request_duration_seconds:p95 expr: histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le, job)) - record: job:http_error_rate expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (job) / sum(rate(http_requests_total[5m])) by (job) Alert 规则(错误率与可用性)groups: - name: alerts.rules rules: - alert: HighErrorRate expr: job:http_error_rate > 0.05 for: 10m labels: severity: critical annotations: summary: 高错误率告警 description: "{{ $labels.job }} 错误率超过 5% 持续 10 分钟" - alert: ServiceUnavailable expr: sum(rate(http_requests_total{status=~"5..|4(0[13])"}[5m])) by (job) > 0 for: 5m labels: severity: warning annotations: summary: 服务可用性告警 description: "{{ $labels.job }} 存在不可用行为,需排查" Alertmanager 路由route: group_by: ['alertname'] group_wait: 30s group_interval: 5m repeat_interval: 2h receivers: - name: default webhook_configs: - url: http://ops.example.com/alerts 参数与验证环境:`Prometheus v2.47+`、`Alertmanager v0.27+`。验证点:录制规则减少查询开销,仪表盘响应更快错误率 > 5% 持续 10m 触发告警,路由正常送达P95 时延曲线可用,趋势与业务观察一致最佳实践使用 Recording Rules 为复杂聚合提供缓存告警设置 `for` 避免瞬时波动误报与服务指标定义(SLI/SLO)对齐阈值与窗口结论通过 Recording 与 Alert 规则,结合 Alertmanager 路由,可构建稳定可靠的监控与告警系统,指标与阈值可验证与可审计。

点赞(0) 打赏

评论列表 共有 0 条评论

暂无评论
立即
投稿

微信公众账号

微信扫一扫加关注

发表
评论
返回
顶部