跳转到内容

Higress 可观察性架构完整指南

本文档提供 Higress 网关的生产级可观察性(Observability)架构设计与实施方案,涵盖指标监控、日志收集、链路追踪、告警通知与可视化的完整技术栈。


快速入口:

完整目录:

  1. 可观察性架构概述
  2. Metrics 指标监控
  3. Logging 日志收集
  4. Tracing 链路追踪
  5. 告警与通知
  6. 可视化仪表盘
  7. 部署实施指南
  8. 运维最佳实践
  9. 常见问题 FAQ
  10. 性能基准与成本估算
  11. 技术选型决策指南

本章节帮助您在5分钟内快速搭建 Higress 可观察性监控栈。

要求版本说明
Kubernetesv1.20+已配置 kubectl
Helm3.x包管理器
可用内存≥4GB监控栈最低要求
#!/bin/bash
# Higress 可观察性栈快速部署脚本
# 1. 创建命名空间
kubectl create namespace monitoring --dry-run=client -o yaml | kubectl apply -f -
# 2. 部署 Prometheus Operator(包含 Grafana)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--set prometheus.prometheusSpec.retention=7d \
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false \
--set grafana.adminPassword=admin \
--set grafana.persistence.enabled=true
# 3. 部署 Loki(日志收集)
helm repo add grafana https://grafana.github.io/helm-charts
helm install loki grafana/loki-stack \
--namespace monitoring \
--set promtail.enabled=true
# 4. 部署 Tempo(链路追踪)
helm install tempo grafana/tempo --namespace monitoring
# 5. 配置 Higress 监控
kubectl apply -f - <<EOF
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: higress-gateway
namespace: higress-system
labels:
release: prometheus
spec:
selector:
matchLabels:
app: higress-gateway
endpoints:
- port: http-monitoring
interval: 30s
path: /stats/prometheus
EOF
echo "✅ 部署完成!"
Terminal window
# 检查所有组件状态
kubectl get pods -n monitoring
# 访问 Grafana(用户名: admin, 密码: admin)
kubectl port-forward -n monitoring svc/prometheus-grafana 3000:80
# 访问 http://localhost:3000
# 验证 Prometheus 抓取
kubectl port-forward -n monitoring svc/prometheus-operated 9090:9090
# 访问 http://localhost:9090/targets 确认 higress-gateway 状态为 UP
  • Prometheus 成功抓取 Higress 指标
  • Loki 正常收集日志
  • Tempo 正常接收追踪数据
  • Grafana 数据源配置正确
  • 告警规则生效

Higress 的可观察性架构基于云原生的三大支柱(Metrics、Logs、Traces),并结合告警与可视化能力:

flowchart TB
    subgraph "数据收集层"
        GW[Higress Gateway Pods]
        Console[Higress Console]

        subgraph "GW 内部采集器"
            Statsd[Statsd Exporter]
            OTEL[OpenTelemetry Collector]
            AL[Access Log]
        end
    end

    subgraph "数据处理层"
        PROM[Prometheus<br/>指标存储]
        LOKI[Loki<br/>日志聚合]
        TEMPO[Tempo<br/>链路存储]
    end

    subgraph "告警层"
        ALERTMGR[Alertmanager]
        CHANNELS[告警渠道<br/>钉钉/Slack/邮件]
    end

    subgraph "可视化层"
        GRAFANA[Grafana<br/>统一仪表盘]
    end

    GW --> Statsd --> PROM
    GW --> OTEL --> PROM
    OTEL --> TEMPO
    GW --> AL --> LOKI

    PROM --> ALERTMGR --> CHANNELS
    PROM --> GRAFANA
    LOKI --> GRAFANA
    TEMPO --> GRAFANA
支柱目标关键技术保留时间典型用途
Metrics(指标)发现问题和趋势Prometheus + Statsd15天(热)+ 90天(冷)趋势分析、容量规划
Logs(日志)定位根因Loki + Kafka7天(热)+ 30天(冷)故障排查、审计
Traces(追踪)分析调用链路OpenTelemetry + Tempo7天性能分析、依赖分析
sequenceDiagram
    participant Client as 客户端
    participant GW as Higress Gateway
    participant OTEL as OTEL Collector
    participant PROM as Prometheus
    participant LOKI as Loki
    participant TEMPO as Tempo
    participant GRAFANA as Grafana

    Client->>GW: HTTP 请求

    par 并行采集
        GW->>GW: 采集请求指标
        GW->>GW: 生成访问日志
        GW->>GW: 注入 Trace Context
    end

    GW->>OTEL: 推送遥测数据
    OTEL->>PROM: 写入指标
    OTEL->>LOKI: 写入日志
    OTEL->>TEMPO: 写入链路

    GW->>Client: 返回响应

    Note over GRAFANA: 查询与分析
    GRAFANA->>PROM: 查询指标
    GRAFANA->>LOKI: 查询日志
    GRAFANA->>TEMPO: 查询链路

组件用途关键配置检查命令
Statsd Exporter指标格式转换映射规则curl localhost:9102/metrics
ServiceMonitor自动发现抓取间隔kubectl get servicemonitor
Prometheus指标存储保留时间访问 /targets 页面

Higress 基于 Envoy 构建,原生支持 Prometheus 指标采集。指标通过 Statsd 协议暴露,并由 Statsd Exporter 转换为 Prometheus 格式。

flowchart LR
    subgraph "Higress Gateway"
        Envoy[Envoy<br/>指标生成]
        StatsdExporter[Statsd Exporter<br/>指标转换]
    end

    subgraph "Prometheus 生态"
        PROM[Prometheus<br/>指标采集]
        ALERTMGR[Alertmanager]
        GRAFANA[Grafana]
    end

    Envoy -->|statsd UDP| StatsdExporter
    StatsdExporter -->|/metrics| PROM
    PROM --> ALERTMGR
    PROM --> GRAFANA
apiVersion: v1
kind: ConfigMap
metadata:
name: higress-statsd-mapper
namespace: higress-system
data:
mapper.conf: |
mappings:
# 请求指标映射
- match: ingress.*.request.*
name: "higress_request_total"
labels:
destination_service: "$2"
response_code: "$3"
# 延迟指标映射
- match: ingress.*.duration.*
name: "higress_request_duration_milliseconds"
labels:
destination_service: "$2"
quantile: "$3"
# 连接指标映射
- match: ingress.*.cx.*
name: "higress_connection_count"
labels:
destination_service: "$2"
state: "$3"
apiVersion: apps/v1
kind: Deployment
metadata:
name: higress-statsd-exporter
namespace: higress-system
spec:
replicas: 2
selector:
matchLabels:
app: higress-statsd-exporter
template:
spec:
containers:
- name: statsd-exporter
image: prom/statsd-exporter:latest
args:
- --statsd.mapping-config=/etc/statsd/mapper.conf
- --statsd.listen-udp=:9125
- --web.listen-address=:9102
ports:
- containerPort: 9125
protocol: UDP
name: statsd
- containerPort: 9102
protocol: TCP
name: metrics
volumeMounts:
- name: config
mountPath: /etc/statsd
volumes:
- name: config
configMap:
name: higress-statsd-mapper
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: higress-gateway
namespace: higress-system
labels:
release: prometheus # 确保与 Prometheus 选择器匹配
spec:
selector:
matchLabels:
app: higress-gateway
endpoints:
- port: http-monitoring
interval: 30s
path: /stats/prometheus

注意: 确保 ServiceMonitor 的 labels 与 Prometheus 的 serviceMonitorSelector 匹配。

# 请求率(Rate)
higress_request_total{destination_service="backend-service",response_code="200"}
# 错误率(Errors)
rate(higress_request_total{response_code=~"5.."}[5m])
# 延迟(Duration)- P95
histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le, destination_service))

USE 指标(Utilization, Saturation, Errors)

Section titled “USE 指标(Utilization, Saturation, Errors)”
# CPU 使用率
rate(process_cpu_seconds_total{pod=~"higress-gateway.*"}[5m])
# 内存使用
container_memory_usage_bytes{container="higress-gateway"}
# 连接数(饱和度)
higress_connection_count{state="active"}
# 网络流量
rate(container_network_receive_bytes_total{container="higress-gateway"}[5m])
rate(container_network_transmit_bytes_total{container="higress-gateway"}[5m])
# 按域名统计的请求量
sum(rate(higress_request_total[5m])) by (authority)
# 按服务统计的 P95 延迟
histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le, destination_service))
# 按状态码统计的请求比例
sum(rate(higress_request_total[5m])) by (response_code) / sum(rate(higress_request_total[5m]))
// Wasm 插件示例:自定义业务指标
package main
import (
"github.com/tetratelabs/proxy-wasm-go-sdk/proxywasm"
"github.com/tetratelabs/proxy-wasm-go-sdk/proxywasm/types"
)
func main() {
proxywasm.SetVMContext(&vmContext{})
}
type vmContext struct {
types.DefaultVMContext
}
func (*vmContext) NewPluginContext(contextID uint32) types.PluginContext {
return &pluginContext{}
}
type pluginContext struct {
types.DefaultPluginContext
}
func (p *pluginContext) OnHttpHeaders(numHeaders int, endOfStream bool) types.Action {
// 自定义指标:API 调用量统计
apiName, _ := proxywasm.GetHttpRequestHeader("x-api-name")
if apiName != "" {
proxywasm.SetProperty([]string{"metric", "api_call_count"}, []byte(apiName))
}
return types.ActionContinue
}

Higress 提供了多个 AI 相关插件,这些插件内置了 Prometheus 指标上报能力。

apiVersion: higress.io/v1
kind: WasmPlugin
metadata:
name: ai-statistics
namespace: higress-system
spec:
url: file:///etc/wasm-plugins/ai-statistics.wasm
phase: AUTHN
priority: 100
config:
enableMetrics: true
metricPrefix: "higress_ai_stats"
metrics:
requestCount:
enabled: true
name: "request_total"
type: "counter"
labels: [route_name, upstream_host, model_name, status_code]
requestDuration:
enabled: true
name: "request_duration_seconds"
type: "histogram"
buckets: [0.001, 0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
tokenCount:
enabled: true
name: "token_total"
type: "counter"
labels: [model_name, token_type, user_id]

查询示例:

# 请求速率(按路由分组)
rate(higress_ai_stats_request_total[5m])
# P95 延迟(按模型分组)
histogram_quantile(0.95, sum(rate(higress_ai_stats_request_duration_seconds_bucket[5m])) by (le, model_name))
# Token 消耗速率(按用户分组)
rate(higress_ai_stats_token_total[5m]) by (user_id, token_type)
# Prometheus 配置
global:
scrape_interval: 30s
evaluation_interval: 30s
storage:
tsdb:
retention.time: 15d
retention.size: 50GB
# 远程存储配置(长期保留)
remote_write:
- url: "http://thanos-receiver:19291/api/v1/receive"
queue_config:
capacity: 10000
max_shards: 200

组件用途关键配置检查命令
访问日志请求记录JSON 格式kubectl logs -l app=higress-gateway
Promtail日志采集标签提取kubectl logs -l app=promtail
Loki日志存储保留策略logcli labels namespace
flowchart TB
    subgraph "日志产生层"
        GW1[Gateway Pod 1]
        GW2[Gateway Pod 2]
        GWN[Gateway Pod N]
    end

    subgraph "日志采集层"
        Promtail[Promtail DaemonSet]
    end

    subgraph "存储层"
        Loki[Loki<br/>热数据 7天]
        S3[S3/OSS<br/>冷数据 30天]
    end

    subgraph "查询层"
        Grafana[Grafana Logs]
    end

    GW1 --> Promtail
    GW2 --> Promtail
    GWN --> Promtail
    Promtail --> Loki
    Loki --> S3
    Grafana --> Loki
# Helm values 配置
gateway:
enableAccessLog: true
accessLogFormat: |
{
"time": "$time_iso8601",
"authority": "$authority",
"method": "$request_method",
"path": "$request_uri",
"protocol": "$protocol",
"status": "$status",
"request_time": "$request_time",
"upstream_response_time": "$upstream_response_time",
"upstream_addr": "$upstream_addr",
"upstream_status": "$upstream_status",
"request_length": "$request_length",
"bytes_sent": "$bytes_sent",
"user_agent": "$http_user_agent",
"x_forwarded_for": "$http_x_forwarded_for",
"request_id": "$request_id",
"trace_id": "$opentelemetry_trace_id",
"span_id": "$opentelemetry_span_id",
"route_name": "$route_name",
"upstream_service": "$upstream_service"
}
apiVersion: v1
kind: ConfigMap
metadata:
name: promtail-config
namespace: logging
data:
promtail.yaml: |
server:
http_listen_port: 9080
positions:
filename: /tmp/positions.yaml
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
# Higress Gateway 日志
- job_name: higress-gateway
kubernetes_sd_configs:
- role: pod
namespaces:
names:
- higress-system
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
regex: higress-gateway
action: keep
pipeline_stages:
- json:
expressions:
time: time
authority: authority
method: method
path: path
status: status
request_time: request_time
trace_id: trace_id
- labels:
status:
method:
authority:
apiVersion: v1
kind: ConfigMap
metadata:
name: loki-config
namespace: logging
data:
loki-config.yaml: |
auth_enabled: false
server:
http_listen_port: 3100
common:
path_prefix: /loki
storage:
filesystem:
chunks_directory: /loki/chunks
replication_factor: 1
schema_config:
configs:
- from: 2024-01-01
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
# 保留策略
limits_config:
retention_period: 168h # 7 天
ingestion_rate_mb: 20
per_stream_rate_limit: 10MB
# 日志压缩
compactor:
working_directory: /loki/compactor
shared_store: filesystem
retention_enabled: true
# 查询 5xx 错误日志
{namespace="higress-system", app="higress-gateway"} |= `status:"5"`
# 查询特定服务的慢请求(>1s)
{namespace="higress-system"} | json | request_time > 1
# 按状态码统计
sum(count_over_time({namespace="higress-system", app="higress-gateway"} | json | status != "" [5m])) by (status)
# 查询包含特定 trace_id 的所有日志
{namespace="higress-system"} |= `trace_id:"abc123"`
# 慢请求 Top 10
topk(10, sum({namespace="higress-system"} | json | unwrap request_time [1h]))

组件用途关键配置检查命令
OTEL Collector数据收集Pipeline 配置kubectl logs -l app=otel-collector
Tempo链路存储采样率、保留Tempo API 查询
Trace Context上下文传播Header 格式检查请求头
flowchart TB
    subgraph "应用层"
        Client[客户端应用]
        Backend[后端服务]
    end

    subgraph "Higress Gateway"
        OTelSDK[OpenTelemetry SDK]
        Tracer[Tracer]
        Propagator[Context Propagator]
    end

    subgraph "OTel Collector"
        Receiver[Receiver<br/>OTLP/gRPC]
        Processor[Processor<br/>Batch/Attributes]
        Exporter[Exporter]
    end

    subgraph "后端存储"
        Tempo[Tempo]
    end

    subgraph "可视化"
        Grafana[Grafana Trace UI]
    end

    Client -->|注入 Trace Context| OTelSDK
    OTelSDK --> Tracer
    Tracer --> Propagator
    Propagator -->|传递 Trace Header| Backend

    Tracer -->|OTLP| Receiver
    Receiver --> Processor
    Processor --> Exporter
    Exporter --> Tempo

    Grafana --> Tempo
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
namespace: higress-system
data:
otel-collector.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 5s
send_batch_size: 10000
attributes:
actions:
- key: environment
value: production
action: insert
memory_limiter:
limit_mib: 512
exporters:
otlp/tempo:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch, attributes]
exporters: [otlp/tempo]
# Helm Values 配置
gateway:
enableTracing: true
tracingSampling: 10.0 # 10% 采样率
telemetry:
v2:
enabled: true
prometheus:
config:
latency: latencies
accessLog:
- name: otel
typedConfig:
"@type": type.googleapis.com/envoy.extensions.access_loggers.open_telemetry.v3.OpenTelemetryAccessLogConfig
common_config:
grpc_service:
envoy_grpc:
cluster_name: otel-collector

警告: 生产环境建议采样率设置为 10-30%,避免 100% 采样导致性能问题。

apiVersion: v1
kind: ConfigMap
metadata:
name: tempo-config
namespace: logging
data:
tempo.yaml: |
server:
http_listen_port: 3100
distributor:
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
storage:
trace:
backend: local
local:
path: /var/tempo/traces
compactor:
compaction:
block_retention: 168h # 7 天

Higress 自动传播以下 Trace Headers:

# W3C Trace Context(推荐)
traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01
tracestate: congo=t61rcWkgMzE
# B3 多 Header 格式
X-B3-TraceId: 0af7651916cd43dd8448eb211c80319c
X-B3-SpanId: b7ad6b7169203331
X-B3-Sampled: 1
# Jaeger 格式
uber-trace-id: 0af7651916cd43dd8448eb211c80319c:0af7651916cd43dd8448eb211c80319c:0af7651916cd43dd8448eb211c80319c:1
// Go 服务集成 OpenTelemetry
package main
import (
"context"
"net/http"
"go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp"
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
tracesdk "go.opentelemetry.io/otel/sdk/trace"
)
func initTracer(serviceName string) error {
ctx := context.Background()
exporter, err := otlptracegrpc.New(ctx,
otlptracegrpc.WithEndpoint("otel-collector.higress-system.svc.cluster.local:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return err
}
tp := tracesdk.NewTracerProvider(
tracesdk.WithBatcher(exporter),
tracesdk.WithSampler(tracesdk.TraceIDRatioBased(0.1)), // 10% 采样
)
otel.SetTracerProvider(tp)
return nil
}
func main() {
initTracer("backend-service")
handler := http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
w.Write([]byte("Hello, World!"))
})
// 包装 handler 以自动追踪
wrappedHandler := otelhttp.NewHandler(handler, "backend-service")
http.ListenAndServe(":8080", wrappedHandler)
}

组件用途关键配置检查命令
PrometheusRule告警规则表达式、阈值访问 /alerts 页面
Alertmanager告警路由接收器、分组amtool alert
钉钉 Webhook通知渠道Token、模板检查钉钉群消息
flowchart TB
    subgraph "监控数据源"
        PROM[Prometheus]
        LOKI[Loki]
    end

    subgraph "告警引擎"
        ALERTMGR[Alertmanager]
        RULES[告警规则]
    end

    subgraph "通知渠道"
        DING[钉钉]
        SLACK[Slack]
        EMAIL[邮件]
        WEBHOOK[Webhook]
    end

    subgraph "告警处理"
        INCIDENT[工单系统]
    end

    RULES --> PROM
    PROM --> ALERTMGR
    LOKI --> ALERTMGR
    ALERTMGR --> DING
    ALERTMGR --> SLACK
    ALERTMGR --> EMAIL
    ALERTMGR --> WEBHOOK
    WEBHOOK --> INCIDENT
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: higress-alerts
namespace: monitoring
spec:
groups:
# ========== 可用性告警 ==========
- name: higress-availability
interval: 30s
rules:
# Gateway Pod 宕机
- alert: HigressGatewayPodDown
expr: up{job="higress-gateway"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Higress Gateway Pod 宕机"
description: "{{ $labels.pod }} 已宕机超过 1 分钟"
# 高错误率
- alert: HigressHighErrorRate
expr: |
(sum(rate(higress_request_total{response_code=~"5.."}[5m])) by (destination_service)
/ sum(rate(higress_request_total[5m])) by (destination_service)) > 0.05
for: 5m
labels:
severity: warning
annotations:
summary: "服务错误率过高"
description: "{{ $labels.destination_service }} 错误率 {{ $value | humanizePercentage }}"
# ========== 性能告警 ==========
- name: higress-performance
interval: 30s
rules:
# 高延迟
- alert: HigressHighLatency
expr: |
histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le, destination_service)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "服务延迟过高"
description: "{{ $labels.destination_service }} P95 延迟 {{ $value }}ms"
# ========== 资源告警 ==========
- name: higress-resources
interval: 30s
rules:
# CPU 使用率过高
- alert: HigressHighCPU
expr: sum(rate(process_cpu_seconds_total{pod=~"higress-gateway.*"}[5m])) by (pod) > 0.8
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway CPU 使用率过高"
description: "{{ $labels.pod }} CPU 使用率 {{ $value | humanizePercentage }}"
# 内存使用率过高
- alert: HigressHighMemory
expr: |
container_memory_usage_bytes{container="higress-gateway"}
/ container_spec_memory_limit_bytes{container="higress-gateway"} > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Gateway 内存使用率过高"
description: "{{ $labels.pod }} 内存使用率 {{ $value | humanizePercentage }}"
apiVersion: v1
kind: ConfigMap
metadata:
name: alertmanager-config
namespace: monitoring
data:
alertmanager.yaml: |
global:
resolve_timeout: 5m
# 告警路由树
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 30s
group_interval: 5m
repeat_interval: 12h
receiver: 'default'
routes:
# Critical 级别告警
- match:
severity: critical
receiver: 'critical-alerts'
# Warning 级别告警
- match:
severity: warning
receiver: 'warning-alerts'
# 告警接收器
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
- name: 'critical-alerts'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/alert'
send_resolved: true
slack_configs:
- channel: '#critical-alerts'
- name: 'warning-alerts'
webhook_configs:
- url: 'http://dingtalk-webhook:8060/warning'
send_resolved: true
# 告警抑制规则
inhibit_rules:
- source_match:
alertname: 'HigressGatewayPodDown'
target_match_re:
alertname: '(HigressHighErrorRate|HigressHighLatency)'
equal: ['pod', 'namespace']
级别名称响应时间升级策略典型场景
P0严重告警15分钟立即升级到值班经理服务完全不可用、数据丢失
P1重要告警1小时升级到团队负责人高错误率(>5%)、高延迟(>2s)
P2次要告警4小时团队内部处理中等错误率(1-5%)
P3信息告警下个工作日记录与复盘低错误率(<1%)

仪表盘用途关键面板数据源
总览仪表盘全局状态QPS、错误率、延迟Prometheus
性能仪表盘深度分析延迟分布、热力图Prometheus
故障排查仪表盘问题定位错误日志、追踪Loki + Tempo
apiVersion: v1
kind: ConfigMap
metadata:
name: grafana-datasources
namespace: monitoring
data:
datasources.yaml: |
apiVersion: 1
datasources:
# Prometheus 数据源
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
# Loki 数据源
- name: Loki
type: loki
access: proxy
url: http://loki:3100
jsonData:
derivedFields:
- datasourceUid: tempo
matcherRegex: "trace_id\": \"([0-9a-f]+)"
name: TraceID
# Tempo 数据源
- name: Tempo
type: tempo
access: proxy
url: http://tempo:3100
jsonData:
tracesToLogs:
datasourceUid: loki
filterByTraceID: true
{
"title": "Higress Gateway Overview",
"panels": [
{
"title": "当前请求量",
"type": "stat",
"targets": [
{ "expr": "sum(rate(higress_request_total[1m]))", "legendFormat": "QPS" }
],
"fieldConfig": {
"defaults": {
"unit": "reqps",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 5000},
{"color": "red", "value": 10000}
]
}
}
}
},
{
"title": "错误率",
"type": "gauge",
"targets": [
{ "expr": "sum(rate(higress_request_total{response_code=~\"5..\"}[5m])) / sum(rate(higress_request_total[5m]))" }
],
"fieldConfig": {
"defaults": {
"unit": "percentunit",
"min": 0,
"max": 1,
"thresholds": {
"steps": [
{"color": "green", "value": 0},
{"color": "yellow", "value": 0.01},
{"color": "red", "value": 0.05}
]
}
}
}
},
{
"title": "P95 延迟",
"type": "gauge",
"targets": [
{ "expr": "histogram_quantile(0.95, sum(rate(higress_request_duration_milliseconds_bucket[5m])) by (le))" }
],
"fieldConfig": {
"defaults": {
"unit": "ms",
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 500},
{"color": "red", "value": 1000}
]
}
}
}
}
]
}
{
"templating": {
"list": [
{
"name": "namespace",
"type": "query",
"query": "label_values(up, namespace)",
"includeAll": true
},
{
"name": "service",
"type": "query",
"query": "label_values(higress_request_total{namespace=\"$namespace\"}, destination_service)",
"multi": true,
"includeAll": true
}
]
}
}

提示: 使用变量可以创建动态仪表盘,减少重复配置。


详见 快速开始 章节。

flowchart TB
    subgraph "负载均衡层"
        LB[Load Balancer]
    end

    subgraph "Prometheus 集群"
        PROM1[Prometheus 1]
        PROM2[Prometheus 2]
        THANOS[Thanos Sidecar]
    end

    subgraph "存储层"
        S3[S3/OSS<br/>长期存储]
    end

    subgraph "查询层"
        THANOS_Q[Thanos Query]
        GRAFANA[Grafana]
    end

    LB --> PROM1
    LB --> PROM2
    PROM1 --> THANOS
    PROM2 --> THANOS
    THANOS --> S3
    THANOS_Q --> GRAFANA
apiVersion: apps/v1
kind: Deployment
metadata:
name: thanos-query
namespace: monitoring
spec:
replicas: 2
selector:
matchLabels:
app: thanos-query
template:
spec:
containers:
- name: thanos
image: quay.io/thanos/thanos:latest
args:
- query
- --query.replica-label=replica
- --store=dnssrv+_grpc._tcp.prometheus-operated.monitoring.svc.cluster.local
ports:
- containerPort: 10902
name: http
数据类型热数据温数据冷数据存储后端
Metrics15天90天1年Prometheus + Thanos
Logs7天30天-Loki + S3
Traces7天--Tempo + S3

# 查询优化
query:
max-concurrency: 20
timeout: 2m
# 存储优化
storage:
tsdb:
out_of_order_time_window: 30m
limits_config:
ingestion_rate_mb: 20
per_stream_rate_limit: 10MB
max_query_parallelism: 32
querier:
max_concurrent_queries: 32
#!/bin/bash
# Higress 可观察性故障排查脚本
echo "=== 1. 检查 Prometheus 抓取状态 ==="
kubectl get servicemonitor -n monitoring
echo "=== 2. 检查 Loki 日志采集 ==="
kubectl logs -n monitoring -l app=loki --tail=50
echo "=== 3. 检查 Tempo 链路追踪 ==="
kubectl logs -n monitoring -l app=tempo --tail=50
echo "=== 4. 查看目标抓取状态 ==="
kubectl exec -n monitoring deployment/prometheus -- \
wget -qO- http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | select(.health != "up")'
echo "=== 5. 检查存储使用情况 ==="
kubectl exec -n monitoring prometheus-prometheus-0 -- df -h /prometheus
规模PrometheusLokiTempo存储(30天)
小型 (<100 pods)2 vCPU, 4GB1 vCPU, 2GB1 vCPU, 2GB100GB
中型 (100-500 pods)4 vCPU, 8GB2 vCPU, 4GB2 vCPU, 4GB500GB
大型 (>500 pods)8 vCPU, 16GB4 vCPU, 8GB4 vCPU, 8GB1TB+

Q1: Prometheus 指标抓取失败,显示 “Connection refused”

Section titled “Q1: Prometheus 指标抓取失败,显示 “Connection refused””

问题现象:

Get http://10.0.0.1:15090/stats/prometheus: dial tcp 10.0.0.1:15090: connect: connection refused

解决方案:

  1. 检查 Higress Gateway Pod 是否正常运行
  2. 验证端口是否正确暴露
  3. 检查网络策略是否阻止连接
Terminal window
# 检查 Pod 状态
kubectl get pods -n higress-system -l app=higress-gateway
# 验证端口
kubectl get svc -n higress-system -o wide

排查步骤:

  1. 确认 Promtail 是否正常运行
  2. 检查日志标签是否正确
  3. 验证日志格式是否为 JSON
Terminal window
# 检查 Promtail 状态
kubectl logs -n monitoring -l app=promtail
# 检查标签
kubectl exec -n monitoring deployment/loki -- logcli labels namespace

可能原因:

  1. 数据源配置错误
  2. 查询时间范围不正确
  3. 指标名称不匹配

解决方案:

Terminal window
# 测试数据源连接
# 在 Grafana UI -> Configuration -> Data Sources -> Test
# 验证指标是否存在
curl http://prometheus:9090/api/v1/label/__name__/values | grep higress

排查步骤:

  1. 确认 OTEL Collector 配置正确
  2. 检查采样率配置
  3. 验证 Trace Context 传播
Terminal window
# 检查 OTEL Collector 配置
kubectl get configmap -n higress-system otel-collector-config -o yaml
# 检查采样率
kubectl get deployment -n higress-system higress-gateway -o yaml | grep -A5 tracing

排查步骤:

  1. 检查 PrometheusRule 是否创建
  2. 验证告警规则语法
  3. 查看告警状态
Terminal window
# 检查 PrometheusRule
kubectl get prometheusrule -n monitoring
# 验证规则语法
kubectl exec -n monitoring prometheus-prometheus-0 -- \
promtool check rules /etc/prometheus/rules/*.yaml

解决方案:

Terminal window
# 检查磁盘使用情况
kubectl exec -n monitoring prometheus-prometheus-0 -- df -h /prometheus
# 调整保留时间
kubectl patch prometheus -n monitoring prometheus --type=merge -p '
spec:
retention: 7d
'
# 扩容 PVC
kubectl patch pvc -n monitoring prometheus-prometheus-prometheus-0 \
-p '{"spec":{"resources":{"requests":{"storage":"100Gi"}}}}'

优化建议:

  1. Prometheus 优化: 增加查询并发、启用查询缓存
  2. Loki 优化: 增加查询并行度、启用查询分割
  3. Grafana 优化: 减少面板数量、使用变量、启用缓存

指标数量抓取间隔写入性能查询延迟(P95)内存使用
10K series30s333 samples/s<10ms500MB
100K series30s3.3K samples/s<50ms2GB
1M series30s33K samples/s<200ms8GB
日志量入库速率查询延迟(P95)压缩比存储需求(7天)
1K logs/s1MB/s<100ms10x60GB
10K logs/s10MB/s<500ms10x600GB
100K logs/s100MB/s<2s10x6TB
组件资源月成本 (AWS)
Prometheus2 vCPU, 4GB~$70
Loki1 vCPU, 2GB~$40
Tempo1 vCPU, 2GB~$40
存储200GB SSD~$50
总计-~$200/月
组件资源月成本 (AWS)
Prometheus4 vCPU, 8GB~$140
Loki2 vCPU, 4GB~$70
Tempo2 vCPU, 4GB~$70
存储1TB SSD~$200
总计-~$480/月
  1. 使用 S3 对象存储 - 成本降低 60-70%
  2. 调整数据保留策略 - 根据实际需求调整
  3. 启用数据下采样 - 存储减少 80%
  4. 使用 Spot 实例 - 非关键组件成本降低 70%

flowchart TD
    Start[开始选择监控栈] --> Q1{团队规模?}

    Q1 -->|小型团队 <10人| Small[轻量级方案]
    Q1 -->|中型团队 10-50人| Medium[标准方案]
    Q1 -->|大型团队 >50人| Large[企业级方案]

    Small --> Q2{需要长期存储?}
    Q2 -->|否| Basic[Prometheus + Grafana]
    Q2 -->|是| BasicThanos[Prometheus + Thanos]

    Medium --> Q3{预算限制?}
    Q3 -->|严格| OpenSource[开源方案]
    Q3 -->|宽松| Hybrid[混合方案]

    Large --> Q4{合规要求?}
    Q4 -->|是| Enterprise[企业级方案]
    Q4 -->|否| Managed[托管服务]
方案优势劣势适用场景成本
Loki轻量级,与 Grafana 集成好查询功能较弱K8s 原生环境
Elasticsearch功能强大,生态完善资源消耗大复杂查询需求
Fluentd + S3成本低,可扩展查询不便归档存储极低
方案优势劣势适用场景成本
Tempo成本低,与 Loki 集成功能相对简单中小规模
Jaeger功能完整,CNCF 项目存储成本高大规模分布式
Zipkin轻量级,易上手功能有限简单场景
推荐方案:
指标: Prometheus (单副本)
日志: Loki (单副本)
追踪: Tempo (单副本)
可视化: Grafana
预估成本: $200-400/月
运维复杂度:
推荐方案:
指标: Prometheus (HA) + Thanos
日志: Loki (集群)
追踪: Tempo (集群)
可视化: Grafana
预估成本: $500-1000/月
运维复杂度:
推荐方案:
指标: Prometheus (联邦) + Thanos/Cortex
日志: Loki (集群) + S3
追踪: Tempo (集群) + S3
可视化: Grafana (HA)
告警: PagerDuty/OpsGenie
预估成本: $1000-3000/月
运维复杂度: