Observability Hub
Observability Hub is a comprehensive platform for managing Kubernetes environments with full observability and operational memory.
Observability Hub
What is this?
Observability Hub is an end-to-end infrastructure platform for a self-hosted Kubernetes environment.
The project is designed from source of truth to runtime operations: infrastructure definition, deployment automation, runtime observability, incident diagnosis, safe remediation, and operational memory.
Git and infrastructure definitions describe the intended state. Host and cluster runtimes execute that state. Telemetry systems expose behavior. MCP tools and dashboards support diagnosis. Remediation flows apply controlled fixes. ADRs, RCAs, and notes preserve what was learned.
- provision infrastructure declaratively
- deploy services through GitOps
- collect logs, metrics, traces, and network signals
- diagnose failures with dashboards, runbooks, and MCP tools
- analyze resource utilization, capacity pressure, and efficiency trends
- remediate safely through bounded operational paths
- preserve decisions and incidents as operational memory
The core loop is:
flowchart TB
Source["Source of Truth<br/>Git, OpenTofu, Kustomize, systemd"]
Runtime["Runtime<br/>K3s, host services, databases"]
Signals["Signals<br/>OTel, Prometheus, Loki, Tempo, Hubble"]
Decisions["Decisions<br/>Grafana, MCP tools, runbooks"]
Actions["Actions<br/>GitOps sync, pod repair, service restart"]
Memory["Memory<br/>ADRs, RCAs, notes, workflows"]
Source --> Runtime
Runtime --> Signals
Signals --> Decisions
Decisions --> Actions
Actions --> Source
Decisions --> Memory
Memory --> Source
🔍 What This Builds (Quick Proof)
- Kubernetes (K3s) homelab running 10+ platform components
- GitOps deployment using Argo CD (App-of-Apps pattern)
- Full observability: logs, metrics, traces (OpenTelemetry + Grafana stack)
- Agent-readable operations through MCP tools for telemetry, pods, host health, and network flows
- High-availability PostgreSQL with automated failover (CloudNativePG)
- Centralized dashboards for monitoring and debugging
- Secrets management without hardcoding credentials
- Trivy-backed container and Kubernetes manifest hardening
- Infrastructure as Code using OpenTofu (layered architecture)
- Resource and capacity analysis using Kubernetes, host, and telemetry signals
- Data ingestion pipeline with worker-based processing
- eBPF-based networking and visibility using Cilium
- Backup and storage integration with Azure Blob Storage + MinIO
📦 Platform Projects
This platform is built as connected ownership domains:
| Domain | What It Proves |
|---|---|
| GitOps Deployment | Declarative cluster management with Argo CD self-healing |
| Observability Stack | Prometheus, Grafana, Loki, Tempo dashboards and alerts |
| Telemetry Pipeline | OpenTelemetry logs, metrics, and traces across services |
| High Availability Database | PostgreSQL failover with Azure Blob Storage backups |
| Secrets Management | Dynamic secrets and policy management with OpenBao |
| Workload Security | Trivy-scanned Dockerfiles and Kubernetes security contexts |
| Infrastructure as Code | Layered OpenTofu architecture for infrastructure ownership |
| Networking | Cilium eBPF visibility, policy control, and flow debugging |
| CI/CD | GitHub Actions, image publication, and GitOps reconciliation |
| Incident Response | Diagnostics, bounded repair actions, RCAs, and runbooks |
| Resource Efficiency | Kubernetes and host telemetry used for capacity and cost-aware analysis |
| Data Ingestion | Worker-based batch processing and analytics jobs |
🧠 Problems Solved
| Problem | Solution |
|---|---|
| Manual deployments | GitOps automation with Argo CD and webhook-triggered reconciliation |
| No visibility into systems | Logs, metrics, traces, network flows, and Grafana dashboards |
| Secrets stored in code | Dynamic secret management with OpenBao |
| Containers running with weak defaults | Non-root images, read-only root filesystems, dropped capabilities, and Trivy scans |
| Single point of failure | High-availability PostgreSQL and backup paths |
| Hard-to-debug issues | MCP diagnostics, dashboards, runbooks, and incident reports |
| Infrastructure drift | Declarative source of truth with OpenTofu, Kustomize, and GitOps |
| Unclear resource pressure | Kubernetes, host, and workload telemetry correlated for capacity decisions |
| Operational knowledge loss | Versioned ADRs, RCAs, notes, and workflow docs |
Documentation Map
| Area | Purpose |
|---|---|
| Full Documentation | Central docs index |
| Architecture | System design and service boundaries |
| Ownership Model | End-to-end operating model for the platform |
| ADRs | Architecture decisions and tradeoffs |
| RCAs | Incidents, failures, and recovery notes |
| Operations Notes | Runbooks and implementation notes |
| Workflows | CI/CD and GitOps workflow reality |
| Visual Gallery | Dashboards and platform screenshots |
🛠️ Tech Stack
Platform & Infrastructure
- Kubernetes (K3s), Helm, Docker
- Argo CD (GitOps)
- OpenTofu (Terraform alternative)
Observability
- OpenTelemetry
- Prometheus, Grafana
- Loki (logs), Tempo (traces), Thanos (metrics scaling)
Data & Storage
- PostgreSQL (CloudNativePG)
- MinIO (S3-compatible)
- Azure Blob Storage
Networking & Security
- Cilium (eBPF networking)
- OpenBao (Secrets Management)
- Tailscale
- Trivy (container and Kubernetes misconfiguration scanning)
Languages
- Go (backend services)
⚠️ Challenges
One challenge was debugging service communication with Cilium networking.
- Problem: Services were unreachable even though pods were running
- Cause: Incorrect network policies blocking traffic
- Fix: Used logs and metrics to identify dropped packets and corrected policies
🚀 Project Evolution
This platform evolved through multiple phases:
- Foundations: Docker, Go services, host-level visibility
- Kubernetes Migration: Moved workloads to K3s + GitOps
- SRE Maturity: Full observability (logs, metrics, traces)
- Infrastructure: OpenTofu layered architecture
- Advanced Networking: Cilium (eBPF)
- Operational Maturity: Argo CD orchestration + HA systems
🚀 Getting Started
<details> <summary><b>Local Setup</b></summary>Prerequisites
- Go
- K3s
- Helm
- Make
- Nix
Setup
cp .env.example .env
Deploy Infrastructure
cd tofu
tofu init
tofu apply
Run Services
make proxy-build
make mcp-build
make install-services
Verify
- Grafana: http://localhost:30000
- Check logs via Loki
📌 Summary
This project demonstrates how to build a production-like DevOps platform using:
- Kubernetes + GitOps
- Full observability (logs, metrics, traces)
- Infrastructure as Code
- Trivy-verified workload hardening
- High availability systems
- Capacity and cost-aware infrastructure analysis
- Real-world debugging and failure handling
It reflects practical infrastructure ownership: designing the system, running it, observing it, debugging it, and using telemetry to make better operational and cost-aware decisions.