Observability Hub

Observability Hub is a comprehensive platform for managing Kubernetes environments with full observability and operational memory.

Observability Hub

What is this?

Observability Hub is an end-to-end infrastructure platform for a self-hosted Kubernetes environment.

The project is designed from source of truth to runtime operations: infrastructure definition, deployment automation, runtime observability, incident diagnosis, safe remediation, and operational memory.

Git and infrastructure definitions describe the intended state. Host and cluster runtimes execute that state. Telemetry systems expose behavior. MCP tools and dashboards support diagnosis. Remediation flows apply controlled fixes. ADRs, RCAs, and notes preserve what was learned.

  • provision infrastructure declaratively
  • deploy services through GitOps
  • collect logs, metrics, traces, and network signals
  • diagnose failures with dashboards, runbooks, and MCP tools
  • analyze resource utilization, capacity pressure, and efficiency trends
  • remediate safely through bounded operational paths
  • preserve decisions and incidents as operational memory

The core loop is:

flowchart TB
    Source["Source of Truth<br/>Git, OpenTofu, Kustomize, systemd"]
    Runtime["Runtime<br/>K3s, host services, databases"]
    Signals["Signals<br/>OTel, Prometheus, Loki, Tempo, Hubble"]
    Decisions["Decisions<br/>Grafana, MCP tools, runbooks"]
    Actions["Actions<br/>GitOps sync, pod repair, service restart"]
    Memory["Memory<br/>ADRs, RCAs, notes, workflows"]

    Source --> Runtime
    Runtime --> Signals
    Signals --> Decisions
    Decisions --> Actions
    Actions --> Source
    Decisions --> Memory
    Memory --> Source

🔍 What This Builds (Quick Proof)

  • Kubernetes (K3s) homelab running 10+ platform components
  • GitOps deployment using Argo CD (App-of-Apps pattern)
  • Full observability: logs, metrics, traces (OpenTelemetry + Grafana stack)
  • Agent-readable operations through MCP tools for telemetry, pods, host health, and network flows
  • High-availability PostgreSQL with automated failover (CloudNativePG)
  • Centralized dashboards for monitoring and debugging
  • Secrets management without hardcoding credentials
  • Trivy-backed container and Kubernetes manifest hardening
  • Infrastructure as Code using OpenTofu (layered architecture)
  • Resource and capacity analysis using Kubernetes, host, and telemetry signals
  • Data ingestion pipeline with worker-based processing
  • eBPF-based networking and visibility using Cilium
  • Backup and storage integration with Azure Blob Storage + MinIO

📦 Platform Projects

This platform is built as connected ownership domains:

DomainWhat It Proves
GitOps DeploymentDeclarative cluster management with Argo CD self-healing
Observability StackPrometheus, Grafana, Loki, Tempo dashboards and alerts
Telemetry PipelineOpenTelemetry logs, metrics, and traces across services
High Availability DatabasePostgreSQL failover with Azure Blob Storage backups
Secrets ManagementDynamic secrets and policy management with OpenBao
Workload SecurityTrivy-scanned Dockerfiles and Kubernetes security contexts
Infrastructure as CodeLayered OpenTofu architecture for infrastructure ownership
NetworkingCilium eBPF visibility, policy control, and flow debugging
CI/CDGitHub Actions, image publication, and GitOps reconciliation
Incident ResponseDiagnostics, bounded repair actions, RCAs, and runbooks
Resource EfficiencyKubernetes and host telemetry used for capacity and cost-aware analysis
Data IngestionWorker-based batch processing and analytics jobs

🧠 Problems Solved

ProblemSolution
Manual deploymentsGitOps automation with Argo CD and webhook-triggered reconciliation
No visibility into systemsLogs, metrics, traces, network flows, and Grafana dashboards
Secrets stored in codeDynamic secret management with OpenBao
Containers running with weak defaultsNon-root images, read-only root filesystems, dropped capabilities, and Trivy scans
Single point of failureHigh-availability PostgreSQL and backup paths
Hard-to-debug issuesMCP diagnostics, dashboards, runbooks, and incident reports
Infrastructure driftDeclarative source of truth with OpenTofu, Kustomize, and GitOps
Unclear resource pressureKubernetes, host, and workload telemetry correlated for capacity decisions
Operational knowledge lossVersioned ADRs, RCAs, notes, and workflow docs

Documentation Map

AreaPurpose
Full DocumentationCentral docs index
ArchitectureSystem design and service boundaries
Ownership ModelEnd-to-end operating model for the platform
ADRsArchitecture decisions and tradeoffs
RCAsIncidents, failures, and recovery notes
Operations NotesRunbooks and implementation notes
WorkflowsCI/CD and GitOps workflow reality
Visual GalleryDashboards and platform screenshots

🛠️ Tech Stack

Platform & Infrastructure

  • Kubernetes (K3s), Helm, Docker
  • Argo CD (GitOps)
  • OpenTofu (Terraform alternative)

Observability

  • OpenTelemetry
  • Prometheus, Grafana
  • Loki (logs), Tempo (traces), Thanos (metrics scaling)

Data & Storage

  • PostgreSQL (CloudNativePG)
  • MinIO (S3-compatible)
  • Azure Blob Storage

Networking & Security

  • Cilium (eBPF networking)
  • OpenBao (Secrets Management)
  • Tailscale
  • Trivy (container and Kubernetes misconfiguration scanning)

Languages

  • Go (backend services)

⚠️ Challenges

One challenge was debugging service communication with Cilium networking.

  • Problem: Services were unreachable even though pods were running
  • Cause: Incorrect network policies blocking traffic
  • Fix: Used logs and metrics to identify dropped packets and corrected policies

🚀 Project Evolution

This platform evolved through multiple phases:

  • Foundations: Docker, Go services, host-level visibility
  • Kubernetes Migration: Moved workloads to K3s + GitOps
  • SRE Maturity: Full observability (logs, metrics, traces)
  • Infrastructure: OpenTofu layered architecture
  • Advanced Networking: Cilium (eBPF)
  • Operational Maturity: Argo CD orchestration + HA systems

👉 View Full Evolution Log


🚀 Getting Started

<details> <summary><b>Local Setup</b></summary>

Prerequisites

  • Go
  • K3s
  • Helm
  • Make
  • Nix

Setup

cp .env.example .env

Deploy Infrastructure

cd tofu
tofu init
tofu apply

Run Services

make proxy-build
make mcp-build
make install-services

Verify

</details>

📌 Summary

This project demonstrates how to build a production-like DevOps platform using:

  • Kubernetes + GitOps
  • Full observability (logs, metrics, traces)
  • Infrastructure as Code
  • Trivy-verified workload hardening
  • High availability systems
  • Capacity and cost-aware infrastructure analysis
  • Real-world debugging and failure handling

It reflects practical infrastructure ownership: designing the system, running it, observing it, debugging it, and using telemetry to make better operational and cost-aware decisions.