以下是在Debian上配置Kafka監控與告警的核心步驟,基于主流工具鏈(kafka_exporter+Prometheus+Grafana):
# 安裝Docker(用于部署kafka_exporter)
sudo apt update && sudo apt install -y docker.io
sudo systemctl start docker && sudo systemctl enable docker
# 安裝Prometheus(監控數據采集)
wget https://github.com/prometheus/prometheus/releases/download/v2.44.0/prometheus-2.44.0.linux-amd64.tar.gz
tar -zxvf prometheus-*.tar.gz
cd prometheus-* && ./prometheus --config.file=prometheus.yml &
# 安裝Grafana(可視化展示)
sudo apt install -y grafana
sudo systemctl start grafana-server && sudo systemctl enable grafana-server
# 拉取鏡像并創建docker-compose配置
docker pull bitnami/kafka-exporter:latest
cat <<EOF > docker-compose.yml
version: '3.1'
services:
kafka-exporter:
image: bitnami/kafka-exporter:latest
command: "--kafka.server=<KAFKA_BROKER_IP>:9092 --kafka.version=3.5.2"
ports:
- "9310:9308"
EOF
# 啟動服務
docker-compose up -d
<KAFKA_BROKER_IP>替換為實際Broker地址,若有多個Broker需逐一列出。編輯Prometheus配置文件prometheus.yml:
scrape_configs:
- job_name: 'kafka-exporter'
metrics_path: '/metrics'
scrape_interval: 15s
static_configs:
- targets: ['localhost:9310'] # 若有多個實例需添加對應IP:端口
在prometheus.yml中添加規則文件路徑:
rule_files:
- "alert-rules.yml"
創建alert-rules.yml文件,包含以下示例規則:
groups:
- name: kafka_alerts
rules:
# Broker異常告警
- alert: KafkaBrokerDown
expr: up{job="kafka-exporter"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Kafka Broker異常"
description: "Broker {{ $labels.instance }} 已下線超過2分鐘"
# 消息積壓告警
- alert: KafkaMessageBacklog
expr: sum(kafka_consumergroup_lag_sum) by (group, topic) > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "消息積壓告警"
description: "Topic {{ $labels.topic }} 的消費組 {{ $labels.group }} 積壓超過5000條"
kafka_disk_usage_percentage監控磁盤使用率。http://localhost:9090)。kafka-server-start.sh,添加JMX參數:export JMX_PORT=9999
export KAFKA_JMX_OPTS="-Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false"
jconsole或Prometheus JMX Exporter采集更詳細的JVM指標。alertmanager,通過Webhook協議對接釘釘、企業微信等通知渠道。http://localhost:9090),查詢kafka_*相關指標,確認數據采集正常。remote_write對接遠程存儲)。參考來源: