Prometheus 指标设计与告警规则实战

YBB 11 阅读 0 评论 0 点赞

Prometheus 指标设计与告警规则实战指标类型与设计Counter：累计值，命名以 `_total` 结尾Gauge：瞬时可增可减，适合库存与并发数Histogram：分桶统计，适合时延分布Summary：局部统计，适合应用内百分位采集与命名清晰标签模型，如 `service`、`endpoint`、`status`避免高基数标签（如用户 ID）PromQL 示例rate(http_requests_total{status=~"5.."}[5m])

histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m])))

告警规则groups:

- name: service-alerts

rules:

- alert: HighErrorRate

expr: sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) > 0.05

for: 10m

labels:

severity: critical

annotations:

summary: 高错误率告警

description: 5xx 比例超过 5% 持续 10 分钟

- alert: HighLatencyP95

expr: histogram_quantile(0.95, sum by (le) (rate(http_request_duration_seconds_bucket[5m]))) > 0.8

for: 10m

labels:

severity: warning

annotations:

summary: P95 时延过高

description: 近 5 分钟 P95 超过 800ms

Alertmanager 路由示例route:

group_by: ['alertname']

receiver: 'default'

receivers:

- name: 'default'

webhook_configs:

- url: 'https://alert.example.com/hook'

验证与监控使用 `/-/healthy` 与 `/-/ready` 检查服务状态通过规则页面与 `promtool` 验证告警语法总结合理的指标与告警设计可实现稳定的可观测性与快速响应能力。

点赞(0) 打赏

本文分类：可观测性
本文标签：prometheus 指标设计告警规则实战 "
浏览次数：11 次浏览
发布日期：2026-02-13 00:29:36
本文链接：https://www.ybb.press/observability/2734.html

上一篇 > Prometheus Remote Write 与长周期存储（Thanos/VM、Retention 与查询）
下一篇 > Prometheus 自定义 Exporter 编写与指标暴露

Prometheus 指标设计与告警规则实战

告警规则groups:

rules:

labels:

annotations:

labels:

annotations:

Alertmanager 路由示例route:

receivers:

webhook_configs:

评论列表共有 0 条评论

发表评论取消回复

Prometheus 指标设计与告警规则实战

告警规则groups:

rules:

labels:

annotations:

labels:

annotations:

Alertmanager 路由示例route:

receivers:

webhook_configs:

Intl.PluralRules：复数规则与文案本地化

&quot;Intl.NumberFormat：货币、单位与紧凑数值格式&quot;

&quot;Intl.ListFormat：列表本地化格式与串联规则&quot;

Interaction to Next Paint（INP）：交互响应指标与优化

评论列表 共有 0 条评论

发表评论 取消回复

"Intl.NumberFormat：货币、单位与紧凑数值格式"

"Intl.ListFormat：列表本地化格式与串联规则"

评论列表共有 0 条评论

发表评论取消回复