Kubernetes monitoring tools: a practical guide without the YAML avalanche
Kubernetes monitoring tools explained for small clusters: kubectl, metrics-server, Prometheus, Lens, K9s, and log tailing with DockLog before you buy a platform.
Kubernetes monitoring tools split into four jobs: is the cluster healthy, are pods running, are users happy (metrics), and what did the app print (logs). Enterprise vendors sell one pane for all four. Small teams usually assemble two or three cheap pieces.
This guide assumes a homelab, staging cluster, or modest prod (not 500 nodes). If you are at platform-team scale, you already have opinions. If you are not, read this before installing everything in the CNCF landscape at once.
Layer 0: kubectl (free, required)
kubectl get pods -A
kubectl describe pod my-app-7f9c8
kubectl logs -f deploy/my-app
kubectl top nodes # needs metrics-server
kubectl top podsEnough for the person who owns kubeconfig. Not enough for the developer who should see one namespace, or on-call without a laptop.
A kubectl workflow that scales to one person
When a pod misbehaves, this sequence answers most questions before you install anything else:
kubectl get pods -n staging -l app=my-app
kubectl describe pod my-app-7f9c8-abcde -n staging
kubectl logs my-app-7f9c8-abcde -n staging --previous # crashed container
kubectl get events -n staging --sort-by='.lastTimestamp'--previous is the one people forget. If the container restarted, current logs may be empty while the last crash reason lives in the previous instance.
kubectl get events surfaces scheduling failures, image pull errors, and eviction warnings. Those never appear in application stdout. More on that below.
Layer 1: Cluster plumbing
metrics-server
Enables kubectl top and the HPA. Install once per cluster. Small, usually fine on k3s and kind with a tweak or two.
Without it, you are flying blind on memory pressure until the OOM killer arrives. On a 4 GB homelab node, run kubectl top pods -A weekly to learn which Deployment actually needs limits.
Kubernetes Dashboard
Official web UI for resources. Useful for "what is CrashLoopBackOff." Less loved for log tailing at scale. Lock it behind auth and network policy.
Dashboard is not a substitute for log RBAC. Giving every developer dashboard access with a shared token is the same problem as sharing cluster-admin.
Lens / OpenLens
Desktop IDE for clusters. Popular with people who want GUI without running another in-cluster service. Logs and metrics depend on what the cluster already exposes.
Lens shines when you manage three clusters and do not want three browser bookmarks. It still assumes kubeconfig on your laptop, same trust boundary as K9s.
Layer 2: Terminal UIs
K9s
The default power-user terminal for Kubernetes. Pods, logs, port-forward, CRDs, plugins. Requires kubeconfig on the machine. DockLog vs K9s compares when a web UI with RBAC beats a local binary.
K9s is hard to beat for the person who deploys. It is a poor fit for the frontend developer who needs staging logs but should never hold cluster-admin credentials.
stern
Multi-pod log tail from the terminal:
stern my-app -n staging
stern . -n staging --since 10mGreat for deploy day. Same SSH/kubeconfig assumptions as kubectl.
Pair stern with a namespace-scoped Role if you want developers to self-serve without full K9s access. Stern still needs credentials on the machine.
Layer 3: In-cluster observability stacks
Prometheus + Grafana (+ Alertmanager)
The open-source metrics standard. scrape ServiceMonitors, chart RED metrics, page on burn rates. kube-prometheus-stack bundles a lot of YAML.
Worth it when:
- You have SLOs
- Multiple services need the same dashboards
- Someone will maintain Prometheus upgrades
Heavy for "three microservices on k3s." Monitoring tools roundup covers when to stay lighter.
On a 2 GB node, skip the full stack. On an 8 GB homelab with five services you care about, a minimal Prometheus plus one Grafana dashboard for HTTP error rate is reasonable.
Loki (+ Grafana)
Log aggregation aligned with Prometheus labels. Powerful, ops-heavy. DockLog vs Grafana/Loki explains pairing live tail with Loki instead of replacing one with the other.
Install Loki when tickets say "find this trace ID from last Tuesday across all pods." Until then, tailing is enough.
Datadog / New Relic / Grafana Cloud
Agents, hosted backends, credit card. Fastest path if budget exists and you do not want to run Thanos.
Reasonable when on-call rotation, mobile escalation, and APM are day-one requirements. Overkill when the cluster is a staging k3s on a NUC.
Log tailing without a logging stack
Many teams reach for Loki on day one because logs feel urgent. You can defer that if the real need is "see staging pod output from a browser."
DockLog mounts kubeconfig or runs in-cluster, scopes users by namespace pattern, and tails the same way it does for Docker. K8s log tailing guide has compose and ingress notes.
When to use DockLog vs K9s vs stern:
| Situation | Tool |
|---|---|
| Platform engineer at laptop | K9s or stern |
| Developer, one namespace, no kubeconfig | DockLog with RBAC |
| On-call from phone | DockLog native apps |
| Search last month across 40 services | Loki or SaaS |
Namespace RBAC in practice
A developer who owns staging should see staging-* pods, not prod-*. DockLog allowed_containers patterns map to namespace prefixes. Full patterns in the RBAC guide.
Platform engineers keep kubeconfig. Everyone else gets a DockLog login. That split prevents the "we gave them cluster-admin because logs were hard" anti-pattern.
Events and alerts people forget
Kubernetes emits Warning events (failed scheduling, backoff, eviction). They are not container stdout. DockLog can surface K8s warning events and route alerts alongside log rules. Useful when the pod never started and there is nothing to tail.
Common event-driven failures with no application logs:
FailedSchedulingbecause CPU requests exceed node capacityFailedMountbecause a PVC or secret is missingImagePullBackOffbecause registry credentials expiredEvictedbecause the node ran out of ephemeral storage
For HTTP uptime, something external (Uptime Kuma, Better Stack, Pingdom) still matters. In-cluster health checks do not see DNS or CDN issues. Self-hosted monitoring covers pairing external checks with in-cluster tail.
Resource limits: the silent monitoring layer
kubectl top shows usage. Limits and requests define what the scheduler believes. A pod can be "Running" and unhealthy because it hits memory limit and restarts.
Checklist:
- Set requests so the scheduler can place pods honestly
- Set limits so one leaky pod cannot take the node down
- Alert on restart count, not just pod phase
Docker health checks principles transfer to liveness and readiness probes. Cheap liveness, stricter readiness, do not hammer dependencies on every probe.
A sane stack for a small cluster
Homelab / staging
- metrics-server
- DockLog or K9s for daily logs
- One external uptime check on the ingress URL
Small production
- Everything above
- Prometheus + Grafana OR managed metrics
- Loki or log SaaS when search tickets appear weekly
- RBAC and audit on anything with cluster credentials (RBAC guide)
Mistakes we see
- Installing kube-prometheus-stack "for logs" (it is metrics-first; add Loki separately)
- Giving every developer cluster-admin because logs are hard
- Tail-only tools on prod without disk rotation on noisy pods (fix the app or limit log driver size on nodes)
- Ignoring control plane and etcd on self-managed clusters (k3s hides some of this; bring-your-own-k8s does not)
- Treating
Runningas healthy without checking restart count - Logging everything at DEBUG in prod because "we might need it" (disk and viewer noise)
Day-one install order
If you are standing up a new k3s cluster this weekend:
- metrics-server (15 minutes)
- Ingress with TLS (Caddy or nginx, reverse proxy guide)
- DockLog in-cluster or with kubeconfig on a trusted host
- One Slack alert for pod BackOff or a log pattern you already grep
- Defer Prometheus until you have a metric you would chart twice
Related reading
- Self-hosted container monitoring
- Docker health checks and monitoring
- Docker log management
- Production reverse proxy
- Native apps for on-call
Kubernetes monitoring tools are not one product. Start with visibility you will use tomorrow, then add Prometheus when metrics drive decisions, and Loki when search drives incidents.