Skip to main content

Documentation Index

Fetch the complete documentation index at: https://jacobpevans-docs-automation-surface.mintlify.app/llms.txt

Use this file to discover all available pages before exploring further.

The goal: fault-tolerant infrastructure I can rebuild from a single nix build.
The homelab is a real production environment, just for one person. Proxmox cluster on bare metal, UniFi networking, Splunk indexers, Cribl Edge collectors, Home Assistant, a docker-host VM for the necessary evil of vendor-locked containers.

Hardware footprint

LayerWhat’s thereNotes
ComputeCustom Proxmox host + two Dell PowerEdge servers (R410, R710) joining the clusterHeterogeneous mix — single-engineer homelab, parts opportunistically combined
Local LLMDedicated bare-metal NixOS box (Ryzen 9 + ROCm-capable GPU, ~12 TB RAIDZ1 model library)Outside the Proxmox cluster — GPU-bound workload kept off hypervisor to avoid passthrough overhead
StorageZFS on Proxmox hosts; SAS backplane on R710 for cluster bulk storage; NVMe for hot tiersMixed-tier by accident, kept by design — bulk on SAS, working sets on NVMe
NetworkingUniFi end-to-end: gateway, switches (with 10G SFP+ uplinks), APsSingle-pane management; 10G fiber backbone where it matters
PowerRack UPS (Eaton 5P750R) for servers; separate UPS for the Home Assistant PiActive NUT monitoring planned once the LLM box is built
Rack managementRaspberry Pi running Home Assistant; iDRAC vKVM jump VM in cluster for Java Web Start console accessOld BMC firmware needs a Java Web Start client; the jump VM keeps Java off the laptop

Network topology

Solid green edges are physical / network. WireGuard tunnels traverse the Internet → UniFi edge. The UniFi gateway is the centre of the LAN; Proxmox, personal devices, and the bare-metal LLM box all hang off it.

Data flow

Coral dashed edges are telemetry; the amber dotted edge to AWS is the disaster-recovery replication path. Same nodes as the network diagram, different concern — split per Rule 1.

Container philosophy

LXC by default. Native packages where possible. Docker is the exception — high-volume network traffic must never cross Docker’s virtualized networking. The decision tree:
  1. Vendor ships Docker-only image with no native path → Docker on the dedicated docker-host VM. Documented exception at the top of the repo’s CLAUDE.md.
  2. Single binary or native package → LXC + Ansible role.
  3. CI/automation → Docker on the docker-host VM, isolated ci_runners network.
  4. Dev / test → Docker on the docker-host VM, Swarm overlay.

What runs where

WorkloadWhereWhy
Proxmox hostBare metalHypervisor
HAProxyLXCLightweight, native systemd unit
Cribl EdgeLXCNative package, network-heavy
Splunk EnterpriseBare-metal-ish VMVendor-only Docker option ruled out for network volume
Home AssistantLXCNative install via supervised path
docker-hostVMIsolated landing pad for vendor Docker images
GitHub Actions runnersDocker on docker-host VM + dedicated runner on LLM boxEphemeral container-per-job, isolated ci_runners network; the LLM-box runner handles workflows that need live access to homelab infrastructure
Qdrant (vector DB)LXC (nesting)Vendor Docker image, lightweight, RAG workload
Local LLM inferenceBare-metal NixOSGPU-bound; kept off Proxmox to avoid passthrough overhead and to run whatever OS gives the fastest ROCm path

Provisioning + configuration

terraform-proxmox builds the VMs and LXCs. ansible-proxmox configures the host. ansible-proxmox-apps configures everything on top. For the rationale on LXC defaults vs the Docker exception, see LXC vs Docker; for the macOS counterpart that runs the monitoring stack as Kubernetes, see Kubernetes overview and orbstack-kubernetes.

DR plan

terraform-aws defines a cold AWS footprint sized to take a Splunk failover. Cribl Edge routes can be flipped to the AWS HEC endpoint via config change; the AI-observability dashboards keep working because they target the same indexes. The full cross-stack map of every collector and where it runs lives at Monitoring agents.