Explorar o código

Add logsys Helm chart: Vector+Loki log collection stack with LoadBalancer Loki service

demo-user hai 2 meses
pai
achega
1f89440acf

+ 6 - 0
k8s/helm/logsys/Chart.yaml

@@ -0,0 +1,6 @@
+apiVersion: v2
+name: logsys
+description: Vector-based log collection chart for shop-recycle services (Vector -> Loki + Prometheus)
+type: application
+version: 0.1.0
+appVersion: "0.1.0"

+ 48 - 0
k8s/helm/logsys/README.md

@@ -0,0 +1,48 @@
+# logsys Helm chart
+
+This chart deploys a complete log collection and storage stack:
+- **Loki** — Log storage engine (StatefulSet with PersistentVolume)
+- **Vector** — Log collector agent (DaemonSet)
+
+The Vector agent collects container JSON logs from the `shop-recycle` namespace and ships parsed labels to Loki and metrics to a Prometheus exporter. Configuration and parsing rules are adapted from the Log.md monitoring guide.
+
+## Install
+
+```bash
+helm upgrade --install vector ./k8s/logsys -n shoprecycle --create-namespace
+```
+
+This single command deploys:
+- Loki StatefulSet with persistent storage (10Gi default)
+- Vector DaemonSet for log collection
+- All ConfigMaps for configuration
+
+## Configuration
+
+**Loki:**
+- `loki.enabled` — Enable/disable Loki deployment (default: true)
+- `loki.namespace` — Namespace (default: shoprecycle)
+- `loki.replicas` — Number of Loki replicas (default: 1)
+- `loki.persistence.size` — Storage size (default: 10Gi)
+- `loki.retention.days` — Log retention in days (default: 30)
+
+**Vector:**
+- `vector.enabled` — Enable/disable Vector deployment (default: true)
+- `vector.loki.endpoint` — Loki push endpoint (default: `http://loki:3100`)
+- `vector.logSelector` — List of app names to collect (defaults: gateway/order/payment/web)
+
+## Usage
+
+After deployment, access Loki via:
+```bash
+kubectl port-forward -n shoprecycle svc/loki 3100:3100
+```
+
+Then configure Grafana to use datasource `http://loki:3100`.
+
+## Notes
+
+- This chart mounts `/var/log` from the host on each node.
+- Loki uses filesystem storage with BoltDB for production-style log indexing.
+- Follow the migration steps in Log.md: start with staging dual-write, then canary, then full roll-out.
+

+ 52 - 0
k8s/helm/logsys/scripts/cardinality-gate.sh

@@ -0,0 +1,52 @@
+#!/bin/bash
+# Cardinality Gate - CI stage validation per Log.md
+# Run this script in CI before deploying Vector to ensure labels stay within cardinality limits
+
+set -euo pipefail
+
+LOKI_ENDPOINT="${LOKI_ENDPOINT:-http://loki:3100}"
+MAX_CARDINALITY=5000
+
+echo "🔍 Checking label cardinality limits for Vector-Loki deployment..."
+
+# Helper function to get cardinality from Loki
+check_label_cardinality() {
+  local label=$1
+  local max=$2
+  
+  echo -n "  Checking ${label}... "
+  
+  # Query Loki for unique values of a label in the last 24h
+  # This requires logcli installed: https://github.com/grafana/loki/releases
+  COUNT=$(logcli query '{job="vector"}' --since=24h -o raw 2>/dev/null | \
+    jq -r ".[] | .${label}" 2>/dev/null | sort | uniq | wc -l) || COUNT=0
+  
+  if [ "$COUNT" -gt "$max" ]; then
+    echo "❌ FAILED (${COUNT} > ${max})"
+    return 1
+  else
+    echo "✅ OK (${COUNT} <= ${max})"
+    return 0
+  fi
+}
+
+# Check critical low-cardinality labels (should be <<5000)
+FAILED=false
+check_label_cardinality "env" 10 || FAILED=true
+check_label_cardinality "app" 50 || FAILED=true
+check_label_cardinality "level" 10 || FAILED=true
+check_label_cardinality "event_class" 20 || FAILED=true
+check_label_cardinality "uri_group" 100 || FAILED=true
+check_label_cardinality "status" 10 || FAILED=true
+
+if [ "$FAILED" = true ]; then
+  echo ""
+  echo "❌ Cardinality check failed! Some labels exceed safe limits."
+  echo "   Review the Vector transform rules and reduce uri_group cardinality if needed."
+  exit 1
+else
+  echo ""
+  echo "✅ All label cardinality checks passed!"
+  echo "   Safe to deploy."
+  exit 0
+fi

+ 59 - 0
k8s/helm/logsys/templates/configmap-loki.yaml

@@ -0,0 +1,59 @@
+{{- if .Values.loki.enabled }}
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: loki-config
+  namespace: {{ .Release.Namespace }}
+  labels:
+    app: loki
+data:
+  loki-config.yaml: |
+    auth_enabled: false
+    
+    ingester:
+      chunk_idle_period: 3m
+      chunk_retain_period: 1m
+      chunk_encoding: gzip
+      max_chunk_age: 2h
+      lifecycler:
+        ring:
+          kvstore:
+            store: inmemory
+          replication_factor: 1
+    
+    limits_config:
+      enforce_metric_name: false
+      reject_old_samples: true
+      reject_old_samples_max_age: 168h
+      ingestion_rate_mb: 256
+      ingestion_burst_size_mb: 512
+    
+    schema_config:
+      configs:
+        - from: 2020-10-24
+          store: boltdb-shipper
+          object_store: filesystem
+          schema: v11
+          index:
+            prefix: index_
+            period: 24h
+    
+    server:
+      http_listen_port: 3100
+      log_level: info
+    
+    storage_config:
+      boltdb_shipper:
+        active_index_directory: /loki/boltdb-shipper-active
+        shared_store: filesystem
+        cache_location: /loki/boltdb-shipper-cache
+      filesystem:
+        directory: {{ .Values.loki.storage.filesystem.directory }}
+    
+    chunk_store_config:
+      max_look_back_period: 0s
+    
+    table_manager:
+      retention_deletes_enabled: {{ .Values.loki.retention.enabled }}
+      retention_period: {{ mul .Values.loki.retention.days 24 }}h
+{{- end }}

+ 138 - 0
k8s/helm/logsys/templates/configmap-vector.yaml

@@ -0,0 +1,138 @@
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: vector-config
+  namespace: {{ .Release.Namespace }}
+  labels:
+    app: vector
+data:
+  vector.toml: |
+    # Vector config adapted from Log.md
+    [sources.kubernetes_logs]
+    type = "kubernetes_logs"
+    include_namespaces = ["{{ .Values.vector.namespace }}"]
+    # read pod/container logs under /var/log/pods
+    host_path = "/var/log"
+
+    [transforms.parse_json]
+    type = "remap"
+    inputs = ["kubernetes_logs"]
+    source = '''
+    parsed = parse_json!(.message)
+    .ts = parsed.ts
+    .level = parsed.level
+    .app = parsed.app
+    .env = parsed.env
+    .traceId = parsed.traceId
+    .uri = parsed.uri
+    .uri_group = parsed.uri_group
+    .duration_ms = to_int!(parsed.duration, 0)
+    .userId = parsed.userId
+    .event = parsed.event
+    .error = parsed.error
+    .status = parsed.status
+    .event_class = parsed.event_class
+
+    # fallback derive event_class from uri_group
+    if starts_with(.uri_group, "/order") {
+      .event_class = "order"
+    } else if starts_with(.uri_group, "/payment") {
+      .event_class = "payment"
+    } else if !exists(.event_class) {
+      .event_class = "api"
+    }
+
+    # keep kubernetes metadata as labels fields
+    .k8s_ns = .kubernetes.namespace_name
+    .k8s_pod = .kubernetes.pod_name
+    .k8s_labels = .kubernetes.labels
+    '''
+
+    [transforms.filter_services]
+    type = "filter"
+    inputs = ["parse_json"]
+    # keep only selected apps (based on k8s labels.app or parsed .app)
+    condition = '(.kubernetes.labels.app in {{ toJson .Values.vector.logSelector }}) || (.app in {{ toJson .Values.vector.logSelector }})'
+
+    [transforms.filter_levels]
+    type = "filter"
+    inputs = ["filter_services"]
+    condition = '.level != "DEBUG" && .level != "TRACE"'
+
+    [sinks.loki]
+    type = "loki"
+    inputs = ["filter_levels"]
+    endpoint = "{{ .Values.vector.loki.endpoint }}"
+    encoding.codec = "json"
+    [sinks.loki.labels]
+    env = "{{ env }}"
+    app = "{{ app }}"
+    level = "{{ level }}"
+    event_class = "{{ event_class }}"
+    uri_group = "{{ uri_group }}"
+    status = "{{ status }}"
+
+    [transforms.to_metrics]
+    type = "log_to_metric"
+    inputs = ["filter_levels"]
+
+    # Counter: Total requests per app/uri_group/status
+    [[transforms.to_metrics.metrics]]
+    type = "counter"
+    field = "message"
+    name = "requests_total"
+    tags.app = "{{ app }}"
+    tags.env = "{{ env }}"
+    tags.uri_group = "{{ uri_group }}"
+
+    # Counter: HTTP request errors
+    [[transforms.to_metrics.metrics]]
+    type = "counter"
+    field = "message"
+    name = "requests_errors_total"
+    filter.condition = '.status == "server_error" || .status == "client_error"'
+    tags.app = "{{ app }}"
+    tags.env = "{{ env }}"
+    tags.status = "{{ status }}"
+
+    # Histogram: Request duration (latency)
+    [[transforms.to_metrics.metrics]]
+    type = "histogram"
+    field = "duration_ms"
+    name = "request_duration_ms"
+    tags.app = "{{ app }}"
+    tags.uri_group = "{{ uri_group }}"
+    tags.env = "{{ env }}"
+
+    # Counter: Total orders
+    [[transforms.to_metrics.metrics]]
+    type = "counter"
+    field = "message"
+    name = "orders_total"
+    filter.condition = '.event_class == "order"'
+    tags.app = "{{ app }}"
+    tags.env = "{{ env }}"
+
+    # Counter: Failed orders
+    [[transforms.to_metrics.metrics]]
+    type = "counter"
+    field = "message"
+    name = "orders_failed_total"
+    filter.condition = '.event_class == "order" && (.status == "server_error" || .status == "client_error")'
+    tags.app = "{{ app }}"
+    tags.env = "{{ env }}"
+
+    # Counter: Payment events
+    [[transforms.to_metrics.metrics]]
+    type = "counter"
+    field = "message"
+    name = "payment_events_total"
+    filter.condition = '.event_class == "payment"'
+    tags.app = "{{ app }}"
+    tags.env = "{{ env }}"
+
+    [sinks.prometheus]
+    type = "prometheus_exporter"
+    inputs = ["to_metrics"]
+    address = "{{ .Values.vector.prometheus.exporterAddress }}"
+    default_namespace = "shop_recycle"

+ 50 - 0
k8s/helm/logsys/templates/daemonset-vector.yaml

@@ -0,0 +1,50 @@
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: vector
+  namespace: {{ .Release.Namespace }}
+  labels:
+    app: vector
+spec:
+  selector:
+    matchLabels:
+      app: vector
+  template:
+    metadata:
+      labels:
+        app: vector
+    spec:
+      serviceAccountName: default
+      nodeSelector: {{ toYaml .Values.nodeSelector | nindent 8 }}
+      tolerations: {{ toYaml .Values.tolerations | nindent 8 }}
+      affinity: {{ toYaml .Values.affinity | nindent 8 }}
+      containers:
+        - name: vector
+          image: "{{ .Values.vector.image.repository }}:{{ .Values.vector.image.tag }}"
+          imagePullPolicy: IfNotPresent
+          resources:
+            limits:
+              cpu: {{ .Values.vector.resources.limits.cpu }}
+              memory: {{ .Values.vector.resources.limits.memory }}
+            requests:
+              cpu: {{ .Values.vector.resources.requests.cpu }}
+              memory: {{ .Values.vector.resources.requests.memory }}
+          env:
+            - name: VECTOR_CONFIG
+              value: /etc/vector/vector.toml
+          volumeMounts:
+            - name: varlog
+              mountPath: /var/log
+              readOnly: true
+            - name: vector-config
+              mountPath: /etc/vector
+              readOnly: true
+      volumes:
+        - name: varlog
+          hostPath:
+            path: /var/log
+            type: DirectoryOrCreate
+        - name: vector-config
+          configMap:
+            name: vector-config
+      terminationGracePeriodSeconds: 30

+ 18 - 0
k8s/helm/logsys/templates/service-loki.yaml

@@ -0,0 +1,18 @@
+{{- if .Values.loki.enabled }}
+apiVersion: v1
+kind: Service
+metadata:
+  name: loki
+  namespace: {{ .Release.Namespace }}
+  labels:
+    app: loki
+spec:
+  type: LoadBalancer
+  ports:
+    - name: http
+      port: 3100
+      targetPort: http
+      protocol: TCP
+  selector:
+    app: loki
+{{- end }}

+ 77 - 0
k8s/helm/logsys/templates/statefulset-loki.yaml

@@ -0,0 +1,77 @@
+{{- if .Values.loki.enabled }}
+apiVersion: apps/v1
+kind: StatefulSet
+metadata:
+  name: loki
+  namespace: {{ .Release.Namespace }}
+  labels:
+    app: loki
+spec:
+  serviceName: loki
+  replicas: {{ .Values.loki.replicas }}
+  selector:
+    matchLabels:
+      app: loki
+  template:
+    metadata:
+      labels:
+        app: loki
+    spec:
+      serviceAccountName: default
+      nodeSelector: {{ toYaml .Values.nodeSelector | nindent 8 }}
+      tolerations: {{ toYaml .Values.tolerations | nindent 8 }}
+      affinity: {{ toYaml .Values.affinity | nindent 8 }}
+      containers:
+        - name: loki
+          image: "{{ .Values.loki.image.repository }}:{{ .Values.loki.image.tag }}"
+          imagePullPolicy: IfNotPresent
+          ports:
+            - name: http
+              containerPort: 3100
+              protocol: TCP
+          livenessProbe:
+            httpGet:
+              path: /ready
+              port: http
+            initialDelaySeconds: 45
+            timeoutSeconds: 1
+            periodSeconds: 10
+            successThreshold: 1
+            failureThreshold: 3
+          readinessProbe:
+            httpGet:
+              path: /ready
+              port: http
+            initialDelaySeconds: 45
+            timeoutSeconds: 1
+            periodSeconds: 10
+            successThreshold: 1
+            failureThreshold: 3
+          resources:
+            limits:
+              cpu: {{ .Values.loki.resources.limits.cpu }}
+              memory: {{ .Values.loki.resources.limits.memory }}
+            requests:
+              cpu: {{ .Values.loki.resources.requests.cpu }}
+              memory: {{ .Values.loki.resources.requests.memory }}
+          volumeMounts:
+            - name: loki-config
+              mountPath: /etc/loki
+              readOnly: true
+            - name: loki-storage
+              mountPath: /loki
+      volumes:
+        - name: loki-config
+          configMap:
+            name: loki-config
+  volumeClaimTemplates:
+    - metadata:
+        name: loki-storage
+      spec:
+        accessModes:
+          - ReadWriteOnce
+        storageClassName: {{ .Values.loki.persistence.storageClassName }}
+        resources:
+          requests:
+            storage: {{ .Values.loki.persistence.size }}
+{{- end }}

+ 53 - 0
k8s/helm/logsys/values.yaml

@@ -0,0 +1,53 @@
+loki:
+  enabled: true
+  namespace: shoprecycle
+  image:
+    repository: grafana/loki
+    tag: 2.9.3
+  replicas: 1
+  storage:
+    type: filesystem
+    filesystem:
+      directory: /loki/chunks
+  persistence:
+    enabled: true
+    size: 10Gi
+    storageClassName: standard
+  resources:
+    limits:
+      cpu: 500m
+      memory: 512Mi
+    requests:
+      cpu: 100m
+      memory: 128Mi
+  retention:
+    enabled: true
+    days: 30
+
+vector:
+  enabled: true
+  namespace: shoprecycle
+  image:
+    repository: timberio/vector
+    tag: 0.36.1-debian
+  loki:
+    endpoint: http://loki:3100
+  prometheus:
+    exporterAddress: 0.0.0.0:9598
+  # pod label/app names to collect. Matches against kubernetes.labels.app
+  logSelector:
+    - shop-recycle-gateway
+    - shop-recycle-order-service
+    - shop-recycle-payment-service
+    - shop-recycle-web
+  resources:
+    limits:
+      cpu: 500m
+      memory: 512Mi
+    requests:
+      cpu: 100m
+      memory: 128Mi
+
+nodeSelector: {}
+tolerations: []
+affinity: {}