|
|
@@ -0,0 +1,174 @@
|
|
|
+Promtail JSON 过滤采集方案 v1.3
|
|
|
+=================================
|
|
|
+
|
|
|
+零侵入 · 防爆炸 · 生产级
|
|
|
+
|
|
|
+适用:K8s 1.20+ / 裸机 / 容器混合场景
|
|
|
+
|
|
|
+目标:业务只输出单行 JSON,采集层(Promtail/Vector)负责解析、采样、标签和告警。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+1. 统一日志格式(业务唯一约束)
|
|
|
+
|
|
|
+建议业务统一输出单行 JSON(最小必需字段):
|
|
|
+
|
|
|
+```json
|
|
|
+{
|
|
|
+ "ts":"2026-01-23T14:23:45.123Z",
|
|
|
+ "level":"INFO|WARN|ERROR",
|
|
|
+ "logger":"c.x.Foo",
|
|
|
+ "msg":"...",
|
|
|
+ "traceId":"a1b2c3",
|
|
|
+ "uri":"/api/order",
|
|
|
+ "duration":120,
|
|
|
+ "userId":123456,
|
|
|
+ "event":"order_create|login|APP_START|security|slow_sql",
|
|
|
+ "error":"NullPointerException: xxx"
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+- 无异常 → `error=""`
|
|
|
+- 慢调用 → `duration>500`
|
|
|
+- 审计/安全/生命周期 → `event=xxx`
|
|
|
+
|
|
|
+logback-spring.xml 模板(建议使用 AsyncAppender)
|
|
|
+
|
|
|
+```xml
|
|
|
+<appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
|
|
|
+ <encoder class="net.logstash.logback.encoder.LogstashEncoder">
|
|
|
+ <includeContext>true</includeContext>
|
|
|
+ <provider class="net.logstash.logback.composite.loggingevent.LoggingEventPatternJsonProvider">
|
|
|
+ <pattern>{"ts":"%d{yyyy-MM-dd'T'HH:mm:ss.SSS'Z'}","level":"%level","logger":"%logger","msg":"%msg","traceId":"%X{traceId:-}","uri":"%X{uri:-}","duration":"%X{duration:-0}","userId":"%X{userId:-}","event":"%X{event:-}","error":"%X{error:-}"}</pattern>
|
|
|
+ </provider>
|
|
|
+ </encoder>
|
|
|
+</appender>
|
|
|
+<appender name="ASYNC_JSON" class="ch.qos.logback.classic.AsyncAppender">
|
|
|
+ <appender-ref ref="JSON" />
|
|
|
+</appender>
|
|
|
+<root level="INFO"><appender-ref ref="ASYNC_JSON"/></root>
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+2. Promtail 配置总览(核心思路)
|
|
|
+
|
|
|
+文件:`/etc/promtail/config.yml`
|
|
|
+
|
|
|
+核心思路:三段减压 → 低基数标签 → 哈希采样 → 异常/慢调用/审计多分支
|
|
|
+
|
|
|
+要点:
|
|
|
+- 0 级:直接丢弃 DEBUG/TRACE(快速判断,避免解析开销)
|
|
|
+- 1 级:关键词保留(仅保留 ERROR、审计事件、慢调用等)
|
|
|
+- 2 级:WARN 采样(例如 10%)
|
|
|
+- 在做深度 JSON 解析前尽量完成 drop/采样,减少解析次数
|
|
|
+- 只把低基数字段作为 labels(`level`, `event`, `namespace`, `exception_type`)
|
|
|
+
|
|
|
+示意 pipeline(思路性,不逐字复制配置):
|
|
|
+
|
|
|
+```yaml
|
|
|
+# 伪配置:快速按 level drop -> 关键词保留 -> WARN 采样 -> json解析 -> timestamp -> labels
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+3. 资源占用参考(4C8G 节点)
|
|
|
+
|
|
|
+阶段 | CPU | 内存 | 说明
|
|
|
+---|---:|---:|---
|
|
|
+原始全量 JSON 解析 | 150-200% | 400 MB | 无过滤
|
|
|
+加三段减压阀后 | 30-40% | 120 MB | 同集群实测
|
|
|
+若换 Vector 解析 | 10-15% | 100 MB | Promtail 仅转发
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+4. 存储与爆炸兜底
|
|
|
+
|
|
|
+1. 标签基数门禁:上线前脚本扫描 24h 日志,任一标签唯一值 >5000 即阻断发布。
|
|
|
+
|
|
|
+2. Loki 整流示例:
|
|
|
+
|
|
|
+```yaml
|
|
|
+limits_config:
|
|
|
+ per_stream_rate_limit: 3MB
|
|
|
+ per_stream_rate_limit_burst: 5MB
|
|
|
+ ingestion_rate_mb: 10
|
|
|
+ ingestion_burst_size_mb: 20
|
|
|
+```
|
|
|
+
|
|
|
+超限直接丢弃并暴露 `rate_limit_discarded_bytes` 指标。
|
|
|
+
|
|
|
+3. Retention:`audit` 流 365d,其余 7d,存到 S3/OSS。
|
|
|
+
|
|
|
+基数扫描示例(CI/部署前):
|
|
|
+
|
|
|
+```bash
|
|
|
+logcli query '{env="prod"}' --since=24h | jq -r '.[].event' | sort | uniq -c | awk '$1>5000{print $2, $1}'
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+5. LogQL 告警模板(Grafana 可直接导入)
|
|
|
+
|
|
|
+```logql
|
|
|
+# 1. 异常突增(对比 1h 前)
|
|
|
+sum by (exception_type) (
|
|
|
+ rate({level="ERROR"}[5m])
|
|
|
+) > 1.5 * sum by (exception_type) (
|
|
|
+ rate({level="ERROR"}[1h] offset 1h)
|
|
|
+)
|
|
|
+
|
|
|
+# 2. 慢调用 P99 超 1s
|
|
|
+quantile_over_time(0.99,
|
|
|
+ {slow="true"} | unwrap duration [5m]
|
|
|
+) by (uri) > 1000
|
|
|
+
|
|
|
+# 3. 审计事件下降
|
|
|
+sum(rate({log_type="audit"}[5m])) by (event) < 0.5 * sum(rate({log_type="audit"}[1h] offset 1h)) by (event)
|
|
|
+```
|
|
|
+
|
|
|
+补充告警(解析/丢弃监控):
|
|
|
+
|
|
|
+```logql
|
|
|
+sum(rate(promtail_parsing_errors_total[5m])) > 0
|
|
|
+rate(loki_ingester_discarded_bytes_total[5m]) > 0
|
|
|
+rate(promtail_dropped_bytes_total[5m]) > 0
|
|
|
+```
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+6. 运维 checklist
|
|
|
+
|
|
|
+每日
|
|
|
+- `kubectl top pod -l app=promtail` 单核 CPU <500m
|
|
|
+- 基数预警(部署前/CI 运行)
|
|
|
+
|
|
|
+每周
|
|
|
+- 检查 `rate(promtail_dropped_bytes_total[5m]) > 0`
|
|
|
+
|
|
|
+每月
|
|
|
+- 回顾并调整 `per_stream_rate_limit` 与采样策略
|
|
|
+
|
|
|
+新增:在 Helm/Chart CI 中加入基数检测脚本,阻断高基数 label 的发布并在 PR 中给出样例。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+7. 升级顺序与迁移建议
|
|
|
+
|
|
|
+升级顺序(推荐): Vector(可选)→ Loki → Promtail → Grafana
|
|
|
+
|
|
|
+迁移步骤(简要)
|
|
|
+1. 在 Staging 双写到旧流/新流 72h 对比告警与查询;
|
|
|
+2. Canary(10% 节点)逐步扩大并监控 parsing_errors、dropped_bytes;
|
|
|
+3. 全量切换并观察 24-72h,若异常立即回滚并恢复双写。
|
|
|
+
|
|
|
+回滚:使用 Git/Helm release 回退到上一个成功版本。
|
|
|
+
|
|
|
+---
|
|
|
+
|
|
|
+8. 开发者 & 运维附录
|
|
|
+
|
|
|
+建议把以下示例文件纳入 repo:
|
|
|
+- `logback-spring.xml` 模板(AsyncAppender)
|
|
|
+- `LoggingFilter.java`(Servlet Filter 示例)
|
|
|
+- `ci/check_cardinality.sh`(CI 门禁脚本)
|