Testing OpenShift Autoscale with Grafana, Loki and Pomtail sidecar

OKD Promtail Loki NFS Kubernetes Grafana OpenShift

Alt text

I was eager to explore OpenShift’s autoscaling capabilities, specifically the Horizontal Pod Autoscaler (HPA).

But I did not just want to watch replicas scale up and down in the cluster view. I wanted clear, observable proof that those replicas were alive, serving traffic, and generating logs that could be traced back to individual pods and nodes.

The goal was simple. Build a demo that behaves the way a real deployment should, including production-correct storage using dynamically provisioned PersistentVolumeClaims (PVCs).

Core Requirements

  • Loki on RWX PVC (nfs-dynamic)
  • Grafana on RWX PVC (nfs-dynamic)
  • Promtail sidecar shipping Grafana’s own logs to Loki
  • Promtail injecting labels: pod, pod_ip, node, namespace
  • HPA functioning correctly because both containers define CPU requests
  • PSA restricted-compliant load pod

The intent was to validate autoscaling in a way that is observable, repeatable, and aligned with how workloads should be built from the start, not something bolted on afterward.

After I finished this up I realized this would be a great first real AWX play or using Helm so I will probably roll that up to a post at some point

Lets create a project:

oc new-project grafana-demo
oc project grafana-demo

Create the claims RWX PVCs (dynamic NFS)

oc apply -f - <<'YAML'
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: loki-data
  namespace: grafana-demo
spec:
  accessModes: [ReadWriteMany]
  storageClassName: nfs-dynamic
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: grafana-data
  namespace: grafana-demo
spec:
  accessModes: [ReadWriteMany]
  storageClassName: nfs-dynamic
  resources:
    requests:
      storage: 5Gi
YAML

Wait until both are Bound:

oc get pvc -n grafana-demo

Loki (PVC-backed, just for internal use no routes)

oc apply -f - <<'YAML'
apiVersion: v1
kind: ConfigMap
metadata:
  name: loki-config
  namespace: grafana-demo
data:
  loki.yaml: |
    auth_enabled: false

    server:
      http_listen_port: 3100

    common:
      path_prefix: /loki
      replication_factor: 1
      ring:
        kvstore:
          store: inmemory

    ingester:
      lifecycler:
        ring:
          kvstore:
            store: inmemory
          replication_factor: 1
      chunk_idle_period: 1h
      chunk_retain_period: 30s

    schema_config:
      configs:
        - from: 2024-01-01
          store: boltdb-shipper
          object_store: filesystem
          schema: v11
          index:
            prefix: index_
            period: 24h

    storage_config:
      filesystem:
        directory: /loki/chunks

    compactor:
      working_directory: /loki/compactor
      shared_store: filesystem
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: loki
  namespace: grafana-demo
spec:
  replicas: 1
  selector:
    matchLabels: {app: loki}
  template:
    metadata:
      labels: {app: loki}
    spec:
      securityContext:
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: loki
        image: docker.io/grafana/loki:2.9.3
        args: ["-config.file=/etc/loki/loki.yaml"]
        ports: [{containerPort: 3100}]
        resources:
          requests:
            cpu: 50m
            memory: 128Mi
          limits:
            cpu: 300m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: false
          runAsNonRoot: true
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: config
          mountPath: /etc/loki
        - name: data
          mountPath: /loki
      volumes:
      - name: config
        configMap: {name: loki-config}
      - name: data
        persistentVolumeClaim: {claimName: loki-data}
---
apiVersion: v1
kind: Service
metadata:
  name: loki
  namespace: grafana-demo
spec:
  selector: {app: loki}
  ports:
  - port: 3100
    targetPort: 3100
YAML

Verify Loki is ready: Note here it takes a minute for it to become ready. Trust me I can be impatient.

oc exec -n grafana-demo deploy/loki -- wget -qO- http://localhost:3100/ready
# expected: ready

Promtail sidecar config (adds node/pod/ip labels)

oc apply -f - <<'YAML'
apiVersion: v1
kind: ConfigMap
metadata:
  name: promtail-sidecar-config
  namespace: grafana-demo
data:
  promtail.yaml: |
    server:
      http_listen_port: 9080
      grpc_listen_port: 0

    positions:
      filename: /tmp/positions.yaml

    clients:
      - url: http://loki:3100/loki/api/v1/push

    scrape_configs:
      - job_name: grafana
        static_configs:
          - targets: [localhost]
            labels:
              app: grafana
              namespace: ${POD_NAMESPACE}
              pod: ${POD_NAME}
              pod_ip: ${POD_IP}
              node: ${NODE_NAME}
              __path__: /var/log/grafana/grafana.log
YAML

Grafana + RWX PVC + Promtail sidecar (HPA-safe)

Critical lesson here: when I first deployed this I could not get grafana to autoscale. Every container in the pod must define CPU requests, and I had missed that part. Once both containers had CPU requests defined things started to work as expected.

This is the chunk of YAML that got it done.

resources:
  requests:
    cpu: 10m
    memory: 32Mi
  limits:
    cpu: 50m
    memory: 128Mi

On to the deploy:

oc apply -f - <<'YAML'
apiVersion: apps/v1
kind: Deployment
metadata:
  name: grafana
  namespace: grafana-demo
spec:
  replicas: 1
  selector:
    matchLabels: {app: grafana}
  template:
    metadata:
      labels: {app: grafana}
    spec:
      securityContext:
        runAsNonRoot: true
        seccompProfile:
          type: RuntimeDefault
      containers:
      - name: grafana
        image: quay.io/openshift/origin-grafana:latest
        ports: [{containerPort: 3000}]
        env:
        - name: GF_SECURITY_ADMIN_USER
          value: admin
        - name: GF_SECURITY_ADMIN_PASSWORD
          value: admin
        - name: GF_LOG_MODE
          value: file
        - name: GF_LOG_LEVEL
          value: info
        - name: GF_LOG_FILE
          value: /var/log/grafana/grafana.log
        resources:
          requests:
            cpu: 50m
            memory: 128Mi
          limits:
            cpu: 300m
            memory: 512Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: grafana-data
          mountPath: /var/lib/grafana
        - name: grafana-logs
          mountPath: /var/log/grafana

      - name: promtail
        image: docker.io/grafana/promtail:2.9.3
        args:
        - -config.expand-env=true
        - -config.file=/etc/promtail/promtail.yaml
        env:
        - name: POD_NAME
          valueFrom: {fieldRef: {fieldPath: metadata.name}}
        - name: POD_NAMESPACE
          valueFrom: {fieldRef: {fieldPath: metadata.namespace}}
        - name: POD_IP
          valueFrom: {fieldRef: {fieldPath: status.podIP}}
        - name: NODE_NAME
          valueFrom: {fieldRef: {fieldPath: spec.nodeName}}
        resources:
          requests:
            cpu: 10m
            memory: 32Mi
          limits:
            cpu: 50m
            memory: 128Mi
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
        - name: grafana-logs
          mountPath: /var/log/grafana
        - name: promtail-config
          mountPath: /etc/promtail

      volumes:
      - name: grafana-data
        persistentVolumeClaim: {claimName: grafana-data}
      - name: grafana-logs
        emptyDir: {}
      - name: promtail-config
        configMap: {name: promtail-sidecar-config}
---
apiVersion: v1
kind: Service
metadata:
  name: grafana
  namespace: grafana-demo
spec:
  selector: {app: grafana}
  ports:
  - port: 3000
    targetPort: 3000
---
apiVersion: route.openshift.io/v1
kind: Route
metadata:
  name: grafana
  namespace: grafana-demo
spec:
  to:
    kind: Service
    name: grafana
  port:
    targetPort: 3000
YAML

Verify Grafana pod is 2/2 Running:

oc get pods -n grafana-demo

Get URL:

oc get route grafana -n grafana-demo

Login: admin / admin

Add Loki datasource (Grafana UI)

oc get route grafana
NAME      HOST/PORT                                 PATH   SERVICES   PORT   TERMINATION   WILDCARD
grafana   grafana-grafana-demo.apps.okd.vv-int.io          grafana    3000                 None

Grafana → Data sources → Add → Loki

URL:

http://loki:3100

Save & Test.

Explore query:

{app="grafana"}

You should see logs immediately.

And you can prove routing/placement:

sum by (pod, node, pod_ip) (count_over_time({app="grafana"}[1m]))

Autoscaling

I am intentionally aggressive here because I want to see it scale. Kubernetes scales up quickly and scales down slowly by design. I can tune stabilization windows and policies later. For now, the objective is visibility. I want to watch the replicas spin up.

The advantage of OpenShift is that the metrics stack is already wired in. There is no metrics server installation and no adapter plumbing. You are simply defining the HPA and letting the controller do its job.

Create HPA

oc autoscale deployment grafana -n grafana-demo \
  --cpu-percent=20 \
  --min=1 \
  --max=6

Make scale-up aggressive

oc patch hpa grafana -n grafana-demo -p '{
  "spec": {
    "behavior": {
      "scaleUp": {
        "stabilizationWindowSeconds": 0,
        "policies": [{
          "type": "Percent",
          "value": 100,
          "periodSeconds": 15
        }]
      }
    }
  }
}'

Confirm HPA is no longer unknown :

oc get hpa grafana -n grafana-demo

You should see cpu: X%/20% (a number, not unknown).

PSA-compliant load pod (internal hammer)

This is basically a CLI that hammers grafana with curl

oc apply -f - <<'YAML'
apiVersion: v1
kind: Pod
metadata:
  name: grafana-load
  namespace: grafana-demo
spec:
  securityContext:
    runAsNonRoot: true
    seccompProfile:
      type: RuntimeDefault
  containers:
  - name: loadgen
    image: quay.io/openshift/origin-cli:latest
    command:
      - sh
      - -c
      - |
        while true; do
          for i in $(seq 1 50); do
            curl -s http://grafana:3000 >/dev/null &
          done
          wait
        done
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop: ["ALL"]
    resources:
      requests:
        cpu: 200m
        memory: 64Mi
      limits:
        cpu: 500m
        memory: 128Mi
YAML

Watch scaling:

oc get hpa grafana -n grafana-demo -w

and:

oc get pods -n grafana-demo -w

You should see replicas climb: 1 → 2 → 4 → 6 (or similar).

Prove autoscaling in Loki logs

In Grafana Dashboards/Import

For this step I am including an importable Grafana-Dashboard

{
  "annotations": {
    "list": [
      {
        "builtIn": 1,
        "datasource": "-- Grafana --",
        "enable": true,
        "hide": true,
        "iconColor": "rgba(0, 211, 255, 1)",
        "name": "Annotations & Alerts",
        "target": {
          "limit": 100,
          "matchAny": false,
          "tags": [],
          "type": "dashboard"
        },
        "type": "dashboard"
      }
    ]
  },
  "editable": true,
  "fiscalYearStartMonth": 0,
  "graphTooltip": 0,
  "id": null,
  "links": [],
  "liveNow": false,
  "panels": [
    {
      "fieldConfig": {
        "defaults": {
          "color": {
            "mode": "thresholds"
          },
          "custom": {
            "align": "auto",
            "displayMode": "auto"
          },
          "mappings": [],
          "thresholds": {
            "mode": "absolute",
            "steps": [
              { "color": "green", "value": null },
              { "color": "red", "value": 80 }
            ]
          }
        },
        "overrides": [
          {
            "matcher": { "id": "byName", "options": "Time" },
            "properties": [
              { "id": "custom.width", "value": 235 }
            ]
          }
        ]
      },
      "gridPos": { "h": 18, "w": 22, "x": 0, "y": 0 },
      "id": 2,
      "options": {
        "footer": {
          "fields": "",
          "reducer": ["sum"],
          "show": false
        },
        "showHeader": true,
        "sortBy": []
      },
      "targets": [
        {
          "datasource": "Loki",
          "expr": "sum by (node, pod, pod_ip) (\n  count_over_time({app=\"grafana\"}[15m])\n)",
          "refId": "A"
        }
      ],
      "title": "Autoscale-Validation",
      "transformations": [
        { "id": "labelsToFields", "options": {} },
        {
          "id": "reduce",
          "options": {
            "reducers": ["last"]
          }
        }
      ],
      "type": "table"
    }
  ],
  "refresh": "",
  "schemaVersion": 34,
  "style": "dark",
  "tags": [],
  "templating": { "list": [] },
  "time": { "from": "now-6h", "to": "now" },
  "timepicker": {},
  "timezone": "",
  "title": "Autoscale-Validation",
  "uid": "eAHCVDvDz",
  "version": 1,
  "weekStart": ""
}

Stop load and watch scale down

Kubernetes scales up fast and scales down slow, I did not want to wait on my video above 😀

oc delete pod grafana-load -n grafana-demo

If something still refuses to scale

The one truth command:

oc describe hpa grafana -n grafana-demo

Thats about it again pretty cool stuff, not a new thing in Kubernetes land however not having to wire in all the things to make something like this go really is a selling point for OpenShift.

Thanks for reading, -Christian

Previous Post