Kubernetes Homelab Series (Part 3): Monitoring and Observability with Prometheus and Grafana

Welcome back to the Kubernetes Homelab Series! 🚀
In the previous post, we set up persistent storage with Longhorn and MinIO. Today, we’re enhancing our cluster with a full monitoring and observability stack using Prometheus, Grafana, and AlertManager. We’ll also use a GitOps approach with ArgoCD to deploy and manage these tools.

Monitoring is critical in any Kubernetes environment, whether in production or in your homelab. A robust monitoring stack gives you real-time insights into resource consumption, application performance, and potential failures. By the end of this guide, you’ll have a fully functional monitoring stack that will help you answer questions like:

How much CPU and memory are my applications consuming?
Are my nodes and workloads operating correctly?
Is my storage nearing its capacity limits?
Are there any anomalies that could impact performance?

Kubernetes Upgrade: I’ve expanded my cluster with two additional worker nodes as VMs, bringing the total to four nodes. With plans to deploy more applications and services, this upgrade was essential to ensure scalability and performance.

The Monitoring Stack at a Glance

The stack we’ll deploy includes:

Prometheus — Collects and stores metrics from Kubernetes and your applications.
Grafana — Visualizes those metrics on customizable dashboards.
AlertManager — Sends notifications when metrics breach defined thresholds.
Node Exporter & Other Exporters — Pre-configured exporters to gather node and pod-level metrics.

“Setting Up a Prometheus and Grafana Monitoring Stack from Scratch” by Platform Engineers, published on Medium. Original Post.

Step 1: Deploy Prometheus and Grafana

Assuming you already have ArgoCD installed (refer to Part 2 for details), we’ll use the kube-prometheus-stack Helm chart.

Add the Helm Chart

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update

Download the kube-prometheus-stack chart for customization:

helm fetch prometheus-community/kube-prometheus-stack --untar

This downloads the chart locally so you can modify the values as needed.

Step 2: Customize the Monitoring Stack

We need to configure persistent storage (using Longhorn) for Prometheus and Grafana and set up our alerting rules.

Create a custom-values.yaml file with the following content:

crds:
  create: false

grafana:
  service:
    type: LoadBalancer
    port: 80
  persistence:
    enabled: true
    accessModes:
      - ReadWriteOnce
    size: 8Gi
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchExpressions:
              - key: kubernetes.io/hostname
                operator: In
                values:
                  - beelink
prometheus:
  prometheusSpec:
    remoteWriteDashboards: false
    affinity:
      nodeAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          nodeSelectorTerms:
            - matchExpressions:
                - key: kubernetes.io/hostname
                  operator: In
                  values:
                    - beelink
    serverSideApply: true
    retention: 12h
    storageSpec:
      volumeClaimTemplate:
        spec:
          storageClassName: longhorn
          accessModes:
            - ReadWriteOnce
          resources:
            requests:
              storage: 25Gi
    resources:
      requests:
        memory: 3Gi
        cpu: 500m
      limits:
        memory: 6Gi
        cpu: 2

This configuration ensures that both Prometheus and Grafana have persistent storage and are scheduled on your preferred node (the Beelink mini PC).

Step 3: Deploy the Stack with ArgoCD

We’ll now define an ArgoCD application to deploy the monitoring stack. Save the following manifest as monitoring-application.yaml:

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: monitoring
  namespace: argocd
spec:
  project: default
  sources:
    - repoURL: 'https://prometheus-community.github.io/helm-charts'
      chart: kube-prometheus-stack
      targetRevision: 67.9.0
      helm:
        valueFiles:
          - $values/apps/kube-prometheus-stack/custom-values.yaml
    - repoURL: 'https://github.com/pablodelarco/kubernetes-homelab'
      targetRevision: main
      ref: values
  destination:
    server: 'https://kubernetes.default.svc'
    namespace: monitoring 
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
    syncOptions: 
      - CreateNamespace=true

This instructs ArgoCD to deploy the Prometheus stack using your customized values from your Git repository.

Step 4: Set Up Grafana Dashboards

Grafana is the visualization layer. One of the key dashboards we’ll use is the Kubernetes Dashboard: 15757. This dashboard displays:

Cluster Resource Usage: Displays CPU and RAM usage (real, requested, and limits) across the cluster.
Kubernetes Objects Overview: The number of nodes, namespaces, running pods, and other resources.
Performance Metrics: Tracks CPU and memory utilization trends over time.
Namespace Breakdown: Visualizes CPU and memory usage per namespace.

Grafana dashboard

Default username: admin
Retrieve password:

kubectl get secret -n monitoring monitoring-grafana -o jsonpath="{.data.admin-password}" | base64 --decode

Step 5: Configuring AlertManager

Defining Alert Rules

To monitor critical cluster events, we’ll define Prometheus alert rules for:

High CPU Usage (>80% for 2 minutes)
High Memory Usage (>80% for 2 minutes)
Node Down (Unreachable for 5 minutes)
CrashLoopBackOff (Pod stuck for 5 minutes)

Create a file named alerts.yaml:

apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: cluster-alerts
  namespace: monitoring
spec:
  groups:
    - name: cluster-rules
      rules:
        - alert: HighCPUUsage
          expr: (1 - avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) > 0.8
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "High CPU usage on {{ $labels.instance }}"
            description: "CPU usage is {{ humanize $value }}% for 2 minutes."

        - alert: HighMemoryUsage
          expr: (1 - (node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes)) > 0.8
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "High memory usage on {{ $labels.instance }}"
            description: "Memory usage is {{ humanize $value }}% for 2 minutes."

        - alert: NodeDown
          expr: up{job="node-exporter"} == 0
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Node {{ $labels.instance }} is down"
            description: "Node has been unreachable for 5 minutes."

        - alert: CrashLoopBackOff
          expr: kube_pod_container_status_waiting_reason{reason="CrashLoopBackOff"} == 1
          for: 5m
          labels:
            severity: critical
          annotations:
            summary: "Pod {{ $labels.pod }} is crashing"
            description: "Pod {{ $labels.pod }} in {{ $labels.namespace }} is in CrashLoopBackOff."

Configuring AlertManager for Email Notifications

To receive alerts via email, define the AlertManager configuration in alertmanager-configmap.yaml:

apiVersion: v1
kind: ConfigMap
metadata:
  name: alertmanager-config
  namespace: monitoring
data:
  alertmanager.yml: |
    global:
      resolve_timeout: 5m
      smtp_smarthost: 'smtp.example.com:587'
      smtp_from: 'alertmanager@example.com'
      smtp_auth_username: 'alertmanager@example.com'
      smtp_auth_password: 'yourpassword'
      smtp_require_tls: true

    route:
      receiver: 'email-notifications'
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 3h

    receivers:
    - name: 'email-notifications'
      email_configs:
      - to: 'your-email@example.com'
        send_resolved: true

Conclusion

With Prometheus, Grafana, and AlertManager fully integrated, your Kubernetes homelab now has a robust monitoring and alerting stack. This setup ensures real-time observability, allowing you to detect and respond to high resource usage, node failures, and pod crashes before they become critical.

Next, we’ll explore Kubernetes networking and ingress, focusing on how to simplify load balancing with MetalLB and enhance remote access using Tailscale. This will provide seamless connectivity and improved security for your homelab. Stay tuned!

If you’re building your own Kubernetes homelab, let’s connect on LinkedIn and exchange insights! You can also check out all my other posts on Medium .