Higress 可观察性架构完整指南
Higress 可观察性架构完整指南
Section titled “Higress 可观察性架构完整指南”本文档提供 Higress 网关的生产级可观察性(Observability)架构设计与实施方案,涵盖指标监控、日志收集、链路追踪、告警通知与可视化的完整技术栈。
快速入口:
完整目录:
- 可观察性架构概述
- Metrics 指标监控
- Logging 日志收集
- Tracing 链路追踪
- 告警与通知
- 可视化仪表盘
- 部署实施指南
- 运维最佳实践
- 常见问题 FAQ
- 性能基准与成本估算
- 技术选型决策指南
本章节帮助您在5分钟内快速搭建 Higress 可观察性监控栈。
| 要求 | 版本 | 说明 |
|---|---|---|
| Kubernetes | v1.20+ | 已配置 kubectl |
| Helm | 3.x | 包管理器 |
| 可用内存 | ≥4GB | 监控栈最低要求 |
#!/bin/bash# Higress 可观察性栈快速部署脚本
# 1. 创建命名空间kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
# 2. 部署 Prometheus Operator(包含 Grafana)helm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo updatehelm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --set prometheus.prometheusSpec.retention=7d \ --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \ --set grafana.adminPassword=admin \ --set grafana.persistence.enabled=true
# 3. 部署 Loki(日志收集)helm repo add grafana https://grafana.github.io/helm-chartshelm install loki grafana/loki-stack \ --namespace monitoring \ --set promtail.enabled=true
# 4. 部署 Tempo(链路追踪)helm install tempo grafana/tempo --namespace monitoring
# 5. 配置 Higress 监控kubectl apply -f - <<EOFapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: higress-gateway namespace: higress-system labels: release: prometheusspec: selector: matchLabels: app: higress-gateway endpoints: - port: http-monitoring interval: 30s path: /stats/prometheusEOF
echo "✅ 部署完成!"# 检查所有组件状态kubectl get pods -n monitoring
# 访问 Grafana(用户名: admin, 密码: admin)kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80# 访问 http://localhost:3000
# 验证 Prometheus 抓取kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090# 访问 http://localhost:9090/targets 确认 higress-gateway 状态为 UP部署验证清单
Section titled “部署验证清单”- Prometheus 成功抓取 Higress 指标
- Loki 正常收集日志
- Tempo 正常接收追踪数据
- Grafana 数据源配置正确
- 告警规则生效
1. 可观察性架构概述
Section titled “1. 可观察性架构概述”1.1 整体架构
Section titled “1.1 整体架构”Higress 的可观察性架构基于云原生的三大支柱(Metrics、Logs、Traces),并结合告警与可视化能力:
flowchart TB
subgraph "数据收集层"
GW[Higress Gateway Pods]
Console[Higress Console]
subgraph "GW 内部采集器"
Statsd[Statsd Exporter]
OTEL[OpenTelemetry Collector]
AL[Access Log]
end
end
subgraph "数据处理层"
PROM[Prometheus<br/>指标存储]
LOKI[Loki<br/>日志聚合]
TEMPO[Tempo<br/>链路存储]
end
subgraph "告警层"
ALERTMGR[Alertmanager]
CHANNELS[告警渠道<br/>钉钉/Slack/邮件]
end
subgraph "可视化层"
GRAFANA[Grafana<br/>统一仪表盘]
end
GW --> Statsd --> PROM
GW --> OTEL --> PROM
OTEL --> TEMPO
GW --> AL --> LOKI
PROM --> ALERTMGR --> CHANNELS
PROM --> GRAFANA
LOKI --> GRAFANA
TEMPO --> GRAFANA
1.2 三大支柱设计原则
Section titled “1.2 三大支柱设计原则”| 支柱 | 目标 | 关键技术 | 保留时间 | 典型用途 |
|---|---|---|---|---|
| Metrics(指标) | 发现问题和趋势 | Prometheus + Statsd | 15天(热)+ 90天(冷) | 趋势分析、容量规划 |
| Logs(日志) | 定位根因 | Loki + Kafka | 7天(热)+ 30天(冷) | 故障排查、审计 |
| Traces(追踪) | 分析调用链路 | OpenTelemetry + Tempo | 7天 | 性能分析、依赖分析 |
1.3 数据流设计
Section titled “1.3 数据流设计”sequenceDiagram
participant Client as 客户端
participant GW as Higress Gateway
participant OTEL as OTEL Collector
participant PROM as Prometheus
participant LOKI as Loki
participant TEMPO as Tempo
participant GRAFANA as Grafana
Client->>GW: HTTP 请求
par 并行采集
GW->>GW: 采集请求指标
GW->>GW: 生成访问日志
GW->>GW: 注入 Trace Context
end
GW->>OTEL: 推送遥测数据
OTEL->>PROM: 写入指标
OTEL->>LOKI: 写入日志
OTEL->>TEMPO: 写入链路
GW->>Client: 返回响应
Note over GRAFANA: 查询与分析
GRAFANA->>PROM: 查询指标
GRAFANA->>LOKI: 查询日志
GRAFANA->>TEMPO: 查询链路
2. Metrics 指标监控
Section titled “2. Metrics 指标监控”| 组件 | 用途 | 关键配置 | 检查命令 |
|---|---|---|---|
| Statsd Exporter | 指标格式转换 | 映射规则 | curl localhost:9102/metrics |
| ServiceMonitor | 自动发现 | 抓取间隔 | kubectl get servicemonitor |
| Prometheus | 指标存储 | 保留时间 | 访问 /targets 页面 |
2.1 Prometheus 集成架构
Section titled “2.1 Prometheus 集成架构”Higress 基于 Envoy 构建,原生支持 Prometheus 指标采集。指标通过 Statsd 协议暴露,并由 Statsd Exporter 转换为 Prometheus 格式。
flowchart LR
subgraph "Higress Gateway"
Envoy[Envoy<br/>指标生成]
StatsdExporter[Statsd Exporter<br/>指标转换]
end
subgraph "Prometheus 生态"
PROM[Prometheus<br/>指标采集]
ALERTMGR[Alertmanager]
GRAFANA[Grafana]
end
Envoy -->|statsd UDP| StatsdExporter
StatsdExporter -->|/metrics| PROM
PROM --> ALERTMGR
PROM --> GRAFANA
2.2 部署 Statsd Exporter
Section titled “2.2 部署 Statsd Exporter”核心配置:指标映射规则
Section titled “核心配置:指标映射规则”apiVersion: v1kind: ConfigMapmetadata: name: higress-statsd-mapper namespace: higress-systemdata: mapper.conf: | mappings: # 请求指标映射 - match: ingress.*.request.* name: "higress_request_total" labels: destination_service: "$2" response_code: "$3"
# 延迟指标映射 - match: ingress.*.duration.* name: "higress_request_duration_milliseconds" labels: destination_service: "$2" quantile: "$3"
# 连接指标映射 - match: ingress.*.cx.* name: "higress_connection_count" labels: destination_service: "$2" state: "$3"Deployment 配置
Section titled “Deployment 配置”apiVersion: apps/v1kind: Deploymentmetadata: name: higress-statsd-exporter namespace: higress-systemspec: replicas: 2 selector: matchLabels: app: higress-statsd-exporter template: spec: containers: - name: statsd-exporter image: prom/statsd-exporter:latest args: - --statsd.mapping-config=/etc/statsd/mapper.conf - --statsd.listen-udp=:9125 - --web.listen-address=:9102 ports: - containerPort: 9125 protocol: UDP name: statsd - containerPort: 9102 protocol: TCP name: metrics volumeMounts: - name: config mountPath: /etc/statsd volumes: - name: config configMap: name: higress-statsd-mapper2.3 Prometheus 抓取配置
Section titled “2.3 Prometheus 抓取配置”ServiceMonitor 配置(推荐)
Section titled “ServiceMonitor 配置(推荐)”apiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: higress-gateway namespace: higress-system labels: release: prometheus # 确保与 Prometheus 选择器匹配spec: selector: matchLabels: app: higress-gateway endpoints: - port: http-monitoring interval: 30s path: /stats/prometheus注意: 确保 ServiceMonitor 的 labels 与 Prometheus 的 serviceMonitorSelector 匹配。
2.4 关键指标定义
Section titled “2.4 关键指标定义”RED 指标(Rate, Errors, Duration)
Section titled “RED 指标(Rate, Errors, Duration)”# 请求率(Rate)higress_request_total{destination_service="backend-service",response_code="200"}
# 错误率(Errors)rate(higress_request_total{response_code=~"5.."}[5m])
# 延迟(Duration)- P95histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le, destination_service))USE 指标(Utilization, Saturation, Errors)
Section titled “USE 指标(Utilization, Saturation, Errors)”# CPU 使用率rate(process_cpu_seconds_total{pod=~"higress-gateway.*"}[5m])
# 内存使用container_memory_usage_bytes{container="higress-gateway"}
# 连接数(饱和度)higress_connection_count{state="active"}
# 网络流量rate(container_network_receive_bytes_total{container="higress-gateway"}[5m])rate(container_network_transmit_bytes_total{container="higress-gateway"}[5m])# 按域名统计的请求量sum(rate(higress_request_total[5m])) by (authority)
# 按服务统计的 P95 延迟histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le, destination_service))
# 按状态码统计的请求比例sum(rate(higress_request_total[5m])) by (response_code) / sum(rate(higress_request_total[5m]))2.5 自定义业务指标
Section titled “2.5 自定义业务指标”通过 Wasm 插件上报自定义指标
Section titled “通过 Wasm 插件上报自定义指标”// Wasm 插件示例:自定义业务指标package main
import ( "github.com/tetratelabs/proxy-wasm-go-sdk/proxywasm" "github.com/tetratelabs/proxy-wasm-go-sdk/proxywasm/types")
func main() { proxywasm.SetVMContext(&vmContext{})}
type vmContext struct { types.DefaultVMContext}
func (*vmContext) NewPluginContext(contextID uint32) types.PluginContext { return &pluginContext{}}
type pluginContext struct { types.DefaultPluginContext}
func (p *pluginContext) OnHttpHeaders(numHeaders int, endOfStream bool) types.Action { // 自定义指标:API 调用量统计 apiName, _ := proxywasm.GetHttpRequestHeader("x-api-name") if apiName != "" { proxywasm.SetProperty([]string{"metric", "api_call_count"}, []byte(apiName)) }
return types.ActionContinue}2.6 AI 插件指标配置
Section titled “2.6 AI 插件指标配置”Higress 提供了多个 AI 相关插件,这些插件内置了 Prometheus 指标上报能力。
AI Statistics 插件
Section titled “AI Statistics 插件”apiVersion: higress.io/v1kind: WasmPluginmetadata: name: ai-statistics namespace: higress-systemspec: url: file:///etc/wasm-plugins/ai-statistics.wasm phase: AUTHN priority: 100 config: enableMetrics: true metricPrefix: "higress_ai_stats" metrics: requestCount: enabled: true name: "request_total" type: "counter" labels: [route_name, upstream_host, model_name, status_code] requestDuration: enabled: true name: "request_duration_seconds" type: "histogram" buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10] tokenCount: enabled: true name: "token_total" type: "counter" labels: [model_name, token_type, user_id]查询示例:
# 请求速率(按路由分组)rate(higress_ai_stats_request_total[5m])
# P95 延迟(按模型分组)histogram_quantile(0.95, sum(rate(higress_ai_stats_request_duration_seconds_bucket[5m])) by (le, model_name))
# Token 消耗速率(按用户分组)rate(higress_ai_stats_token_total[5m]) by (user_id, token_type)2.7 指标保留策略
Section titled “2.7 指标保留策略”# Prometheus 配置global: scrape_interval: 30s evaluation_interval: 30s
storage: tsdb: retention.time: 15d retention.size: 50GB
# 远程存储配置(长期保留)remote_write: - url: "http://thanos-receiver:19291/api/v1/receive" queue_config: capacity: 10000 max_shards: 2003. Logging 日志收集
Section titled “3. Logging 日志收集”| 组件 | 用途 | 关键配置 | 检查命令 |
|---|---|---|---|
| 访问日志 | 请求记录 | JSON 格式 | kubectl logs -l app=higress-gateway |
| Promtail | 日志采集 | 标签提取 | kubectl logs -l app=promtail |
| Loki | 日志存储 | 保留策略 | logcli labels namespace |
3.1 日志架构设计
Section titled “3.1 日志架构设计”flowchart TB
subgraph "日志产生层"
GW1[Gateway Pod 1]
GW2[Gateway Pod 2]
GWN[Gateway Pod N]
end
subgraph "日志采集层"
Promtail[Promtail DaemonSet]
end
subgraph "存储层"
Loki[Loki<br/>热数据 7天]
S3[S3/OSS<br/>冷数据 30天]
end
subgraph "查询层"
Grafana[Grafana Logs]
end
GW1 --> Promtail
GW2 --> Promtail
GWN --> Promtail
Promtail --> Loki
Loki --> S3
Grafana --> Loki
3.2 访问日志配置
Section titled “3.2 访问日志配置”JSON 格式访问日志
Section titled “JSON 格式访问日志”# Helm values 配置gateway: enableAccessLog: true accessLogFormat: | { "time": "$time_iso8601", "authority": "$authority", "method": "$request_method", "path": "$request_uri", "protocol": "$protocol", "status": "$status", "request_time": "$request_time", "upstream_response_time": "$upstream_response_time", "upstream_addr": "$upstream_addr", "upstream_status": "$upstream_status", "request_length": "$request_length", "bytes_sent": "$bytes_sent", "user_agent": "$http_user_agent", "x_forwarded_for": "$http_x_forwarded_for", "request_id": "$request_id", "trace_id": "$opentelemetry_trace_id", "span_id": "$opentelemetry_span_id", "route_name": "$route_name", "upstream_service": "$upstream_service" }3.3 Promtail 配置(日志采集)
Section titled “3.3 Promtail 配置(日志采集)”apiVersion: v1kind: ConfigMapmetadata: name: promtail-config namespace: loggingdata: promtail.yaml: | server: http_listen_port: 9080
positions: filename: /tmp/positions.yaml
clients: - url: http://loki:3100/loki/api/v1/push
scrape_configs: # Higress Gateway 日志 - job_name: higress-gateway kubernetes_sd_configs: - role: pod namespaces: names: - higress-system relabel_configs: - source_labels: [__meta_kubernetes_pod_label_app] regex: higress-gateway action: keep pipeline_stages: - json: expressions: time: time authority: authority method: method path: path status: status request_time: request_time trace_id: trace_id - labels: status: method: authority:3.4 Loki 日志聚合配置
Section titled “3.4 Loki 日志聚合配置”apiVersion: v1kind: ConfigMapmetadata: name: loki-config namespace: loggingdata: loki-config.yaml: | auth_enabled: false server: http_listen_port: 3100
common: path_prefix: /loki storage: filesystem: chunks_directory: /loki/chunks replication_factor: 1
schema_config: configs: - from: 2024-01-01 store: boltdb-shipper object_store: filesystem schema: v11 index: prefix: index_ period: 24h
# 保留策略 limits_config: retention_period: 168h # 7 天 ingestion_rate_mb: 20 per_stream_rate_limit: 10MB
# 日志压缩 compactor: working_directory: /loki/compactor shared_store: filesystem retention_enabled: true3.5 日志查询示例
Section titled “3.5 日志查询示例”LogQL 查询语法
Section titled “LogQL 查询语法”# 查询 5xx 错误日志{namespace="higress-system", app="higress-gateway"} |= `status:"5"`
# 查询特定服务的慢请求(>1s){namespace="higress-system"} | json | request_time > 1
# 按状态码统计sum(count_over_time({namespace="higress-system", app="higress-gateway"} | json | status != "" [5m])) by (status)
# 查询包含特定 trace_id 的所有日志{namespace="higress-system"} |= `trace_id:"abc123"`
# 慢请求 Top 10topk(10, sum({namespace="higress-system"} | json | unwrap request_time [1h]))4. Tracing 链路追踪
Section titled “4. Tracing 链路追踪”| 组件 | 用途 | 关键配置 | 检查命令 |
|---|---|---|---|
| OTEL Collector | 数据收集 | Pipeline 配置 | kubectl logs -l app=otel-collector |
| Tempo | 链路存储 | 采样率、保留 | Tempo API 查询 |
| Trace Context | 上下文传播 | Header 格式 | 检查请求头 |
4.1 OpenTelemetry 集成架构
Section titled “4.1 OpenTelemetry 集成架构”flowchart TB
subgraph "应用层"
Client[客户端应用]
Backend[后端服务]
end
subgraph "Higress Gateway"
OTelSDK[OpenTelemetry SDK]
Tracer[Tracer]
Propagator[Context Propagator]
end
subgraph "OTel Collector"
Receiver[Receiver<br/>OTLP/gRPC]
Processor[Processor<br/>Batch/Attributes]
Exporter[Exporter]
end
subgraph "后端存储"
Tempo[Tempo]
end
subgraph "可视化"
Grafana[Grafana Trace UI]
end
Client -->|注入 Trace Context| OTelSDK
OTelSDK --> Tracer
Tracer --> Propagator
Propagator -->|传递 Trace Header| Backend
Tracer -->|OTLP| Receiver
Receiver --> Processor
Processor --> Exporter
Exporter --> Tempo
Grafana --> Tempo
4.2 OTEL Collector 配置
Section titled “4.2 OTEL Collector 配置”apiVersion: v1kind: ConfigMapmetadata: name: otel-collector-config namespace: higress-systemdata: otel-collector.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318
processors: batch: timeout: 5s send_batch_size: 10000
attributes: actions: - key: environment value: production action: insert
memory_limiter: limit_mib: 512
exporters: otlp/tempo: endpoint: tempo:4317 tls: insecure: true
service: pipelines: traces: receivers: [otlp] processors: [memory_limiter, batch, attributes] exporters: [otlp/tempo]4.3 Higress Gateway 追踪配置
Section titled “4.3 Higress Gateway 追踪配置”# Helm Values 配置gateway: enableTracing: true tracingSampling: 10.0 # 10% 采样率
telemetry: v2: enabled: true prometheus: config: latency: latencies accessLog: - name: otel typedConfig: "@type": type.googleapis.com/envoy.extensions.access_loggers.open_telemetry.v3.OpenTelemetryAccessLogConfig common_config: grpc_service: envoy_grpc: cluster_name: otel-collector警告: 生产环境建议采样率设置为 10-30%,避免 100% 采样导致性能问题。
4.4 Tempo 部署配置
Section titled “4.4 Tempo 部署配置”apiVersion: v1kind: ConfigMapmetadata: name: tempo-config namespace: loggingdata: tempo.yaml: | server: http_listen_port: 3100
distributor: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317
storage: trace: backend: local local: path: /var/tempo/traces
compactor: compaction: block_retention: 168h # 7 天4.5 Trace Context 传播
Section titled “4.5 Trace Context 传播”Higress 自动传播以下 Trace Headers:
# W3C Trace Context(推荐)traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01tracestate: congo=t61rcWkgMzE
# B3 多 Header 格式X-B3-TraceId: 0af7651916cd43dd8448eb211c80319cX-B3-SpanId: b7ad6b7169203331X-B3-Sampled: 1
# Jaeger 格式uber-trace-id: 0af7651916cd43dd8448eb211c80319c:0af7651916cd43dd8448eb211c80319c:0af7651916cd43dd8448eb211c80319c:14.6 后端服务集成示例
Section titled “4.6 后端服务集成示例”// Go 服务集成 OpenTelemetrypackage main
import ( "context" "net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp" "go.opentelemetry.io/otel" "go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc" tracesdk "go.opentelemetry.io/otel/sdk/trace")
func initTracer(serviceName string) error { ctx := context.Background()
exporter, err := otlptracegrpc.New(ctx, otlptracegrpc.WithEndpoint("otel-collector.higress-system.svc.cluster.local:4317"), otlptracegrpc.WithInsecure(), ) if err != nil { return err }
tp := tracesdk.NewTracerProvider( tracesdk.WithBatcher(exporter), tracesdk.WithSampler(tracesdk.TraceIDRatioBased(0.1)), // 10% 采样 )
otel.SetTracerProvider(tp) return nil}
func main() { initTracer("backend-service")
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) { w.Write([]byte("Hello, World!")) })
// 包装 handler 以自动追踪 wrappedHandler := otelhttp.NewHandler(handler, "backend-service") http.ListenAndServe(":8080", wrappedHandler)}5. 告警与通知
Section titled “5. 告警与通知”| 组件 | 用途 | 关键配置 | 检查命令 |
|---|---|---|---|
| PrometheusRule | 告警规则 | 表达式、阈值 | 访问 /alerts 页面 |
| Alertmanager | 告警路由 | 接收器、分组 | amtool alert |
| 钉钉 Webhook | 通知渠道 | Token、模板 | 检查钉钉群消息 |
5.1 告警架构
Section titled “5.1 告警架构”flowchart TB
subgraph "监控数据源"
PROM[Prometheus]
LOKI[Loki]
end
subgraph "告警引擎"
ALERTMGR[Alertmanager]
RULES[告警规则]
end
subgraph "通知渠道"
DING[钉钉]
SLACK[Slack]
EMAIL[邮件]
WEBHOOK[Webhook]
end
subgraph "告警处理"
INCIDENT[工单系统]
end
RULES --> PROM
PROM --> ALERTMGR
LOKI --> ALERTMGR
ALERTMGR --> DING
ALERTMGR --> SLACK
ALERTMGR --> EMAIL
ALERTMGR --> WEBHOOK
WEBHOOK --> INCIDENT
5.2 核心告警规则
Section titled “5.2 核心告警规则”apiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: higress-alerts namespace: monitoringspec: groups: # ========== 可用性告警 ========== - name: higress-availability interval: 30s rules: # Gateway Pod 宕机 - alert: HigressGatewayPodDown expr: up{job="higress-gateway"} == 0 for: 1m labels: severity: critical annotations: summary: "Higress Gateway Pod 宕机" description: "{{ $labels.pod }} 已宕机超过 1 分钟"
# 高错误率 - alert: HigressHighErrorRate expr: | (sum(rate(higress_request_total{response_code=~"5.."}[5m])) by (destination_service) / sum(rate(higress_request_total[5m])) by (destination_service)) > 0.05 for: 5m labels: severity: warning annotations: summary: "服务错误率过高" description: "{{ $labels.destination_service }} 错误率 {{ $value | humanizePercentage }}"
# ========== 性能告警 ========== - name: higress-performance interval: 30s rules: # 高延迟 - alert: HigressHighLatency expr: | histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le, destination_service)) > 1000 for: 5m labels: severity: warning annotations: summary: "服务延迟过高" description: "{{ $labels.destination_service }} P95 延迟 {{ $value }}ms"
# ========== 资源告警 ========== - name: higress-resources interval: 30s rules: # CPU 使用率过高 - alert: HigressHighCPU expr: sum(rate(process_cpu_seconds_total{pod=~"higress-gateway.*"}[5m])) by (pod) > 0.8 for: 10m labels: severity: warning annotations: summary: "Gateway CPU 使用率过高" description: "{{ $labels.pod }} CPU 使用率 {{ $value | humanizePercentage }}"
# 内存使用率过高 - alert: HigressHighMemory expr: | container_memory_usage_bytes{container="higress-gateway"} / container_spec_memory_limit_bytes{container="higress-gateway"} > 0.85 for: 10m labels: severity: warning annotations: summary: "Gateway 内存使用率过高" description: "{{ $labels.pod }} 内存使用率 {{ $value | humanizePercentage }}"5.3 Alertmanager 配置
Section titled “5.3 Alertmanager 配置”apiVersion: v1kind: ConfigMapmetadata: name: alertmanager-config namespace: monitoringdata: alertmanager.yaml: | global: resolve_timeout: 5m
# 告警路由树 route: group_by: ['alertname', 'cluster', 'service'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'default'
routes: # Critical 级别告警 - match: severity: critical receiver: 'critical-alerts'
# Warning 级别告警 - match: severity: warning receiver: 'warning-alerts'
# 告警接收器 receivers: - name: 'default' slack_configs: - channel: '#alerts'
- name: 'critical-alerts' webhook_configs: - url: 'http://dingtalk-webhook:8060/alert' send_resolved: true slack_configs: - channel: '#critical-alerts'
- name: 'warning-alerts' webhook_configs: - url: 'http://dingtalk-webhook:8060/warning' send_resolved: true
# 告警抑制规则 inhibit_rules: - source_match: alertname: 'HigressGatewayPodDown' target_match_re: alertname: '(HigressHighErrorRate|HigressHighLatency)' equal: ['pod', 'namespace']5.4 告警分级响应
Section titled “5.4 告警分级响应”| 级别 | 名称 | 响应时间 | 升级策略 | 典型场景 |
|---|---|---|---|---|
| P0 | 严重告警 | 15分钟 | 立即升级到值班经理 | 服务完全不可用、数据丢失 |
| P1 | 重要告警 | 1小时 | 升级到团队负责人 | 高错误率(>5%)、高延迟(>2s) |
| P2 | 次要告警 | 4小时 | 团队内部处理 | 中等错误率(1-5%) |
| P3 | 信息告警 | 下个工作日 | 记录与复盘 | 低错误率(<1%) |
6. 可视化仪表盘
Section titled “6. 可视化仪表盘”| 仪表盘 | 用途 | 关键面板 | 数据源 |
|---|---|---|---|
| 总览仪表盘 | 全局状态 | QPS、错误率、延迟 | Prometheus |
| 性能仪表盘 | 深度分析 | 延迟分布、热力图 | Prometheus |
| 故障排查仪表盘 | 问题定位 | 错误日志、追踪 | Loki + Tempo |
6.1 Grafana 数据源配置
Section titled “6.1 Grafana 数据源配置”apiVersion: v1kind: ConfigMapmetadata: name: grafana-datasources namespace: monitoringdata: datasources.yaml: | apiVersion: 1 datasources: # Prometheus 数据源 - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true
# Loki 数据源 - name: Loki type: loki access: proxy url: http://loki:3100 jsonData: derivedFields: - datasourceUid: tempo matcherRegex: "trace_id\": \"([0-9a-f]+)" name: TraceID
# Tempo 数据源 - name: Tempo type: tempo access: proxy url: http://tempo:3100 jsonData: tracesToLogs: datasourceUid: loki filterByTraceID: true6.2 核心仪表盘配置
Section titled “6.2 核心仪表盘配置”{ "title": "Higress Gateway Overview", "panels": [ { "title": "当前请求量", "type": "stat", "targets": [ { "expr": "sum(rate(higress_request_total[1m]))", "legendFormat": "QPS" } ], "fieldConfig": { "defaults": { "unit": "reqps", "thresholds": { "steps": [ {"color": "green", "value": null}, {"color": "yellow", "value": 5000}, {"color": "red", "value": 10000} ] } } } }, { "title": "错误率", "type": "gauge", "targets": [ { "expr": "sum(rate(higress_request_total{response_code=~\"5..\"}[5m])) / sum(rate(higress_request_total[5m]))" } ], "fieldConfig": { "defaults": { "unit": "percentunit", "min": 0, "max": 1, "thresholds": { "steps": [ {"color": "green", "value": 0}, {"color": "yellow", "value": 0.01}, {"color": "red", "value": 0.05} ] } } } }, { "title": "P95 延迟", "type": "gauge", "targets": [ { "expr": "histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le))" } ], "fieldConfig": { "defaults": { "unit": "ms", "thresholds": { "steps": [ {"color": "green", "value": null}, {"color": "yellow", "value": 500}, {"color": "red", "value": 1000} ] } } } } ]}6.3 仪表盘最佳实践
Section titled “6.3 仪表盘最佳实践”{ "templating": { "list": [ { "name": "namespace", "type": "query", "query": "label_values(up, namespace)", "includeAll": true }, { "name": "service", "type": "query", "query": "label_values(higress_request_total{namespace=\"$namespace\"}, destination_service)", "multi": true, "includeAll": true } ] }}提示: 使用变量可以创建动态仪表盘,减少重复配置。
7. 部署实施指南
Section titled “7. 部署实施指南”7.1 快速部署方案
Section titled “7.1 快速部署方案”详见 快速开始 章节。
7.2 生产环境部署架构
Section titled “7.2 生产环境部署架构”flowchart TB
subgraph "负载均衡层"
LB[Load Balancer]
end
subgraph "Prometheus 集群"
PROM1[Prometheus 1]
PROM2[Prometheus 2]
THANOS[Thanos Sidecar]
end
subgraph "存储层"
S3[S3/OSS<br/>长期存储]
end
subgraph "查询层"
THANOS_Q[Thanos Query]
GRAFANA[Grafana]
end
LB --> PROM1
LB --> PROM2
PROM1 --> THANOS
PROM2 --> THANOS
THANOS --> S3
THANOS_Q --> GRAFANA
7.3 Thanos 高可用配置
Section titled “7.3 Thanos 高可用配置”apiVersion: apps/v1kind: Deploymentmetadata: name: thanos-query namespace: monitoringspec: replicas: 2 selector: matchLabels: app: thanos-query template: spec: containers: - name: thanos image: quay.io/thanos/thanos:latest args: - query - --query.replica-label=replica - --store=dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local ports: - containerPort: 10902 name: http7.4 数据保留策略
Section titled “7.4 数据保留策略”| 数据类型 | 热数据 | 温数据 | 冷数据 | 存储后端 |
|---|---|---|---|---|
| Metrics | 15天 | 90天 | 1年 | Prometheus + Thanos |
| Logs | 7天 | 30天 | - | Loki + S3 |
| Traces | 7天 | - | - | Tempo + S3 |
8. 运维最佳实践
Section titled “8. 运维最佳实践”8.1 性能优化
Section titled “8.1 性能优化”Prometheus 优化
Section titled “Prometheus 优化”# 查询优化query: max-concurrency: 20 timeout: 2m
# 存储优化storage: tsdb: out_of_order_time_window: 30mLoki 优化
Section titled “Loki 优化”limits_config: ingestion_rate_mb: 20 per_stream_rate_limit: 10MB max_query_parallelism: 32
querier: max_concurrent_queries: 328.2 故障排查脚本
Section titled “8.2 故障排查脚本”#!/bin/bash# Higress 可观察性故障排查脚本
echo "=== 1. 检查 Prometheus 抓取状态 ==="kubectl get servicemonitor -n monitoring
echo "=== 2. 检查 Loki 日志采集 ==="kubectl logs -n monitoring -l app=loki --tail=50
echo "=== 3. 检查 Tempo 链路追踪 ==="kubectl logs -n monitoring -l app=tempo --tail=50
echo "=== 4. 查看目标抓取状态 ==="kubectl exec -n monitoring deployment/prometheus -- \ wget -qO- http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
echo "=== 5. 检查存储使用情况 ==="kubectl exec -n monitoring prometheus-prometheus-0 -- df -h /prometheus8.3 容量规划
Section titled “8.3 容量规划”| 规模 | Prometheus | Loki | Tempo | 存储(30天) |
|---|---|---|---|---|
| 小型 (<100 pods) | 2 vCPU, 4GB | 1 vCPU, 2GB | 1 vCPU, 2GB | 100GB |
| 中型 (100-500 pods) | 4 vCPU, 8GB | 2 vCPU, 4GB | 2 vCPU, 4GB | 500GB |
| 大型 (>500 pods) | 8 vCPU, 16GB | 4 vCPU, 8GB | 4 vCPU, 8GB | 1TB+ |
9. 常见问题 FAQ
Section titled “9. 常见问题 FAQ”Q1: Prometheus 指标抓取失败,显示 “Connection refused”
Section titled “Q1: Prometheus 指标抓取失败,显示 “Connection refused””问题现象:
Get http://10.0.0.1:15090/stats/prometheus: dial tcp 10.0.0.1:15090: connect: connection refused解决方案:
- 检查 Higress Gateway Pod 是否正常运行
- 验证端口是否正确暴露
- 检查网络策略是否阻止连接
# 检查 Pod 状态kubectl get pods -n higress-system -l app=higress-gateway
# 验证端口kubectl get svc -n higress-system -o wideQ2: Loki 日志查询返回空结果
Section titled “Q2: Loki 日志查询返回空结果”排查步骤:
- 确认 Promtail 是否正常运行
- 检查日志标签是否正确
- 验证日志格式是否为 JSON
# 检查 Promtail 状态kubectl logs -n monitoring -l app=promtail
# 检查标签kubectl exec -n monitoring deployment/loki -- logcli labels namespaceQ3: Grafana 仪表盘无法显示数据
Section titled “Q3: Grafana 仪表盘无法显示数据”可能原因:
- 数据源配置错误
- 查询时间范围不正确
- 指标名称不匹配
解决方案:
# 测试数据源连接# 在 Grafana UI -> Configuration -> Data Sources -> Test
# 验证指标是否存在curl http://prometheus:9090/api/v1/label/__name__/values | grep higressQ4: Tempo 链路追踪数据缺失
Section titled “Q4: Tempo 链路追踪数据缺失”排查步骤:
- 确认 OTEL Collector 配置正确
- 检查采样率配置
- 验证 Trace Context 传播
# 检查 OTEL Collector 配置kubectl get configmap -n higress-system otel-collector-config -o yaml
# 检查采样率kubectl get deployment -n higress-system higress-gateway -o yaml | grep -A5 tracingQ5: 告警规则不生效
Section titled “Q5: 告警规则不生效”排查步骤:
- 检查 PrometheusRule 是否创建
- 验证告警规则语法
- 查看告警状态
# 检查 PrometheusRulekubectl get prometheusrule -n monitoring
# 验证规则语法kubectl exec -n monitoring prometheus-prometheus-0 -- \ promtool check rules /etc/prometheus/rules/*.yamlQ6: 磁盘空间不足
Section titled “Q6: 磁盘空间不足”解决方案:
# 检查磁盘使用情况kubectl exec -n monitoring prometheus-prometheus-0 -- df -h /prometheus
# 调整保留时间kubectl patch prometheus -n monitoring prometheus --type=merge -p 'spec: retention: 7d'
# 扩容 PVCkubectl patch pvc -n monitoring prometheus-prometheus-prometheus-0 \ -p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'Q7: 性能问题 - 查询缓慢
Section titled “Q7: 性能问题 - 查询缓慢”优化建议:
- Prometheus 优化: 增加查询并发、启用查询缓存
- Loki 优化: 增加查询并行度、启用查询分割
- Grafana 优化: 减少面板数量、使用变量、启用缓存
10. 性能基准与成本估算
Section titled “10. 性能基准与成本估算”10.1 性能基准测试
Section titled “10.1 性能基准测试”Prometheus 性能
Section titled “Prometheus 性能”| 指标数量 | 抓取间隔 | 写入性能 | 查询延迟(P95) | 内存使用 |
|---|---|---|---|---|
| 10K series | 30s | 333 samples/s | <10ms | 500MB |
| 100K series | 30s | 3.3K samples/s | <50ms | 2GB |
| 1M series | 30s | 33K samples/s | <200ms | 8GB |
Loki 性能
Section titled “Loki 性能”| 日志量 | 入库速率 | 查询延迟(P95) | 压缩比 | 存储需求(7天) |
|---|---|---|---|---|
| 1K logs/s | 1MB/s | <100ms | 10x | 60GB |
| 10K logs/s | 10MB/s | <500ms | 10x | 600GB |
| 100K logs/s | 100MB/s | <2s | 10x | 6TB |
10.2 成本估算
Section titled “10.2 成本估算”小型部署 (100 pods)
Section titled “小型部署 (100 pods)”| 组件 | 资源 | 月成本 (AWS) |
|---|---|---|
| Prometheus | 2 vCPU, 4GB | ~$70 |
| Loki | 1 vCPU, 2GB | ~$40 |
| Tempo | 1 vCPU, 2GB | ~$40 |
| 存储 | 200GB SSD | ~$50 |
| 总计 | - | ~$200/月 |
中型部署 (500 pods)
Section titled “中型部署 (500 pods)”| 组件 | 资源 | 月成本 (AWS) |
|---|---|---|
| Prometheus | 4 vCPU, 8GB | ~$140 |
| Loki | 2 vCPU, 4GB | ~$70 |
| Tempo | 2 vCPU, 4GB | ~$70 |
| 存储 | 1TB SSD | ~$200 |
| 总计 | - | ~$480/月 |
10.3 成本优化建议
Section titled “10.3 成本优化建议”- 使用 S3 对象存储 - 成本降低 60-70%
- 调整数据保留策略 - 根据实际需求调整
- 启用数据下采样 - 存储减少 80%
- 使用 Spot 实例 - 非关键组件成本降低 70%
11. 技术选型决策指南
Section titled “11. 技术选型决策指南”11.1 监控栈选择决策树
Section titled “11.1 监控栈选择决策树”flowchart TD
Start[开始选择监控栈] --> Q1{团队规模?}
Q1 -->|小型团队 <10人| Small[轻量级方案]
Q1 -->|中型团队 10-50人| Medium[标准方案]
Q1 -->|大型团队 >50人| Large[企业级方案]
Small --> Q2{需要长期存储?}
Q2 -->|否| Basic[Prometheus + Grafana]
Q2 -->|是| BasicThanos[Prometheus + Thanos]
Medium --> Q3{预算限制?}
Q3 -->|严格| OpenSource[开源方案]
Q3 -->|宽松| Hybrid[混合方案]
Large --> Q4{合规要求?}
Q4 -->|是| Enterprise[企业级方案]
Q4 -->|否| Managed[托管服务]
11.2 方案对比
Section titled “11.2 方案对比”日志方案对比
Section titled “日志方案对比”| 方案 | 优势 | 劣势 | 适用场景 | 成本 |
|---|---|---|---|---|
| Loki | 轻量级,与 Grafana 集成好 | 查询功能较弱 | K8s 原生环境 | 低 |
| Elasticsearch | 功能强大,生态完善 | 资源消耗大 | 复杂查询需求 | 高 |
| Fluentd + S3 | 成本低,可扩展 | 查询不便 | 归档存储 | 极低 |
链路追踪方案对比
Section titled “链路追踪方案对比”| 方案 | 优势 | 劣势 | 适用场景 | 成本 |
|---|---|---|---|---|
| Tempo | 成本低,与 Loki 集成 | 功能相对简单 | 中小规模 | 低 |
| Jaeger | 功能完整,CNCF 项目 | 存储成本高 | 大规模分布式 | 中 |
| Zipkin | 轻量级,易上手 | 功能有限 | 简单场景 | 低 |
11.3 推荐配置模板
Section titled “11.3 推荐配置模板”小型团队 (<100 pods)
Section titled “小型团队 (<100 pods)”推荐方案: 指标: Prometheus (单副本) 日志: Loki (单副本) 追踪: Tempo (单副本) 可视化: Grafana
预估成本: $200-400/月运维复杂度: 低中型团队 (100-500 pods)
Section titled “中型团队 (100-500 pods)”推荐方案: 指标: Prometheus (HA) + Thanos 日志: Loki (集群) 追踪: Tempo (集群) 可视化: Grafana
预估成本: $500-1000/月运维复杂度: 中大型团队 (>500 pods)
Section titled “大型团队 (>500 pods)”推荐方案: 指标: Prometheus (联邦) + Thanos/Cortex 日志: Loki (集群) + S3 追踪: Tempo (集群) + S3 可视化: Grafana (HA) 告警: PagerDuty/OpsGenie
预估成本: $1000-3000/月运维复杂度: 高