Self-Hosted AI Infrastructure: Build Secure, Sovereign LLM Services

1. Introduction: The Paradigm Shift to Localized Intelligence

The computational landscape of Artificial Intelligence (AI) is undergoing a significant bifurcation. While hyperscalers continue to train trillion-parameter models in massive data centers, a concurrent revolution is occurring at the edge. The capability to run high-fidelity Large Language Models (LLMs) on consumer-grade hardware has transitioned from a theoretical possibility to a practical reality, driven by advances in quantization techniques and efficient inference runtimes. This report details the architectural design and implementation of a robust, self-hosted AI platform that mirrors the functionality of commercial APIs while retaining absolute data sovereignty.

The objective of this infrastructure is to provide a unified, internet-accessible interface for local LLMs. This system must serve two distinct consumer types: human users interacting via a rich chat interface (Open WebUI), and programmatic clients consuming an OpenAI-compatible API (LM Studio). To achieve this while mitigating the severe security risks associated with exposing local hardware to the public internet, we employ a zero-trust networking architecture.

This report serves as a definitive implementation guide for DevOps engineers and system architects. It integrates LM Studio as the high-performance inference engine, Docker for containerized application management, Tailscale for secure, NAT-traversing networking, and Caddy as an intelligent reverse proxy and security gateway. By the conclusion of this document, the reader will possess a production-ready blueprint for a "Personal AI Cloud" that is secure, scalable, and accessible from anywhere in the world.

1.1 Architectural Overview

The proposed stack is composed of four distinct layers, each selected for its specific strengths in the modern DevOps ecosystem:

Layer	Component	Role	Key Justification
Network Overlay	Tailscale	Mesh VPN & Ingress	Eliminates the need for dangerous router port forwarding; provides "Funnel" capabilities for public ingress; manages TLS certificates via MagicDNS.¹
Security Gateway	Caddy	Reverse Proxy	Native integration with Tailscale for automatic HTTPS; simplifies complex routing logic (API vs. UI); enforces authentication barriers.³
Inference Runtime	LM Studio	Model Server	superior support for GGUF quantization; robust "headless" CLI operation; creates an OpenAI-compliant API endpoint for universal compatibility.⁵
Presentation	Open WebUI	Chat Interface	A feature-rich, containerized UI that mimics commercial chat experiences; supports RAG (Retrieval Augmented Generation) and multi-model management.⁷

The interaction model is designed to minimize latency while maximizing security. External requests originating from the public internet are encrypted and routed via Tailscale's relay servers directly to the local host. Caddy intercepts these requests, terminating the TLS connection. Based on the request path, Caddy routes traffic either to the Docker container hosting Open WebUI or directly to the LM Studio API port, enforcing Basic Authentication for the latter to prevent unauthorized API consumption.

1.2 Hardware Requirements and Capacity Planning

Before deploying the software stack, one must validate the underlying hardware. Self-hosting LLMs is memory-bound rather than compute-bound for single-user scenarios.

VRAM Considerations:

The primary constraint is Video Random Access Memory (VRAM). The model weights must fit entirely within VRAM to achieve acceptable token generation speeds (tokens per second, or tok/s).

7B - 8B Parameter Models (e.g., Llama 3, Mistral): require approximately 6-8 GB of VRAM at 4-bit quantization (Q4_K_M).
14B - 20B Parameter Models: require 12-16 GB VRAM.
70B Parameter Models: require 24 GB+ (often necessitating dual GPUs like RTX 3090/4090).

System Memory (RAM): While GPU offloading is preferred, LM Studio can offload layers to the system CPU and RAM if VRAM is exhausted. However, this incurs a significant performance penalty, often dropping speeds from 50+ tok/s to <5 tok/s. A minimum of 32 GB system RAM is recommended to support the operating system, Docker containers, and partial model offloading.⁹

2. Infrastructure Foundation: Operating System and Networking

This section establishes the bedrock of our deployment. While LM Studio and Docker are cross-platform, Linux (specifically Ubuntu LTS) is the recommended host operating system for this architecture due to its superior handling of container networking and background service management.¹⁰

2.1 Host Environment Preparation

NVIDIA Driver Installation:

For LM Studio to utilize the GPU, the proprietary NVIDIA drivers must be installed. On a fresh Ubuntu installation:

Bash

sudo apt update && sudo apt upgrade -y
sudo apt install -y ubuntu-drivers-common
sudo ubuntu-drivers autoinstall

Verify the installation with nvidia-smi. The output must show the Driver Version and CUDA Version. Crucially, the CUDA version displayed here is the maximum supported version; LM Studio includes its own bundled CUDA runtime libraries, but they rely on the host driver being present and compatible.

2.2 Docker Engine and Compose

Open WebUI is distributed as a Docker image. Using Docker ensures that the complex Python dependencies and frontend frameworks required by the UI do not interfere with the host system.

Installation Strategy:

Do not use the docker.io package from standard repositories, as it is often outdated. Install from the official Docker repository to ensure support for the latest Compose features.

Bash

# Set up Docker's apt repository.
# Add Docker's official GPG key:
sudo apt-get update
sudo apt-get install ca-certificates curl
sudo install -m 0755 -d /etc/apt/keyrings
sudo curl -fsSL https://download.docker.com/linux/ubuntu/gpg -o /etc/apt/keyrings/docker.asc
sudo chmod a+r /etc/apt/keyrings/docker.asc

# Add the repository to Apt sources:
echo \
"deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.asc] https://download.docker.com/linux/ubuntu \
$(. /etc/os-release && echo "$VERSION_CODENAME") stable" | \
sudo tee /etc/apt/sources.list.d/docker.list > /dev/null
sudo apt-get update

# Install Docker packages.
sudo apt-get install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin

User Permissions:

To avoid running Docker commands as root (which can mess up file permissions for persistent volumes), add the current user to the docker group:

Bash

sudo usermod -aG docker $USER
newgrp docker

2.3 Tailscale Network Layer

Tailscale creates a secure mesh network (Tailnet) that overlays the public internet. It allows devices to communicate as if they were on the same physical LAN, regardless of their actual location.

Why Tailscale? Traditional remote access requires opening ports (Port Forwarding) on the home router. This exposes the device to port scanners and botnets. Tailscale uses NAT traversal techniques (STUN/TURN) to punch through firewalls securely, establishing encrypted WireGuard tunnels.¹

Installation:

Bash

curl -fsSL https://tailscale.com/install.sh | sh
sudo tailscale up

Upon running tailscale up, a URL is generated. Visit this URL to authenticate the machine and add it to your Tailnet.

DNS Configuration (MagicDNS): To use SSL certificates effectively, "MagicDNS" must be enabled in the Tailscale Admin Console. This assigns a stable hostname (e.g., gpu-server.tailnet-name.ts.net) to the machine, which is resolvable from anywhere inside the Tailnet.¹¹

3. The Inference Runtime: Deep Dive into LM Studio

LM Studio has evolved from a simple GUI application into a robust server platform capable of headless operation via its Command Line Interface (CLI), lms. This section details the configuration of the inference engine.⁵

3.1 Installation and the lms CLI

While LM Studio offers an AppImage for Linux, the CLI workflow is preferred for server deployments. The CLI tool lms allows for scripting, automation, and service management.

Bootstrapping the CLI:

If you have the GUI installed, the CLI might already be available or can be bootstrapped. For a purely headless server, use the installer script:

Bash

curl -fsSL https://lmstudio.ai/install.sh | bash

After installation, verify the binary is in your path:

Bash

lms --help

The output should list subcommands like server, ls, load, and get.¹³

3.2 Model Management: Selection and Quantization

The effectiveness of the entire stack depends on the model loaded. LM Studio natively supports the GGUF format, which is optimized for running on CPUs and consumer GPUs via Apple Metal or CUDA.

Searching and Downloading:

Use the lms get command to query the Hugging Face hub.

Bash

lms get llama-3-8b-instruct

The interface will present a list of available quantization levels.

Q8_0: Highest accuracy, largest size (near float16 performance).
Q4_K_M: The "sweet spot" for most users. Balanced perplexity (accuracy) and size.
Q2_K: Significant degradation in reasoning capabilities; not recommended unless hardware is severely constrained.

Listing Local Models:

Once downloaded, models reside in the internal cache. To see available models and—crucially—their "Model Key" or path:

Bash

lms ls

This command returns the identifier required to load the model programmatically.¹³

3.3 Server Architecture and Configuration

The lms server command spins up a local HTTP server that mimics the OpenAI API specification. This compatibility is vital, as it allows tools designed for OpenAI (like Open WebUI) to plug into LM Studio seamlessly.

The Binding Address (--host): By default, lms server start binds to 127.0.0.1 (localhost).⁵

Implication: Only processes on the same physical machine can access it.
Docker Nuance: A Docker container on a Linux host cannot easily reach 127.0.0.1 of the host without specific networking flags (host.docker.internal).
Security: We want to bind to localhost. We do not want this port (1234) exposed to the LAN or internet directly. Caddy (our proxy) will handle the ingress.

CORS (Cross-Origin Resource Sharing): Web-based interfaces (like Open WebUI running in a browser) often strictly enforce CORS. Even though Open WebUI runs server-side, enabling CORS on LM Studio is a best practice to prevent connection rejections during complex fetch operations or client-side plugin interactions.¹⁴

Command Construction:

The robust command to start the server manually is:

Bash

lms server start --port 1234 --cors

However, this is a foreground process. For a production report, we must configure this as a background daemon.

3.4 Persistence: Systemd Service Implementation

Relying on a terminal window to keep the server running is fragile. We must create a Systemd service to manage the LM Studio process, ensuring it starts on boot and restarts on failure.¹⁰

Service File Creation:

Create a file at /etc/systemd/system/lmstudio.service.

Ini, TOML

[Unit]
Description=LM Studio Inference Server
After=network-online.target
Wants=network-online.target

Type=simple
User=your_username
# Set the home directory to ensure lms finds the model cache
Environment="HOME=/home/your_username"
WorkingDirectory=/home/your_username

# Pre-start command to ensure a specific model is loaded into memory
# This uses the 'load' subcommand to prep the GPU before the server listens
# --gpu max: Offload all layers to GPU
# --context-length: Set to 8192 or model max
# --identifier: Sets the API name (e.g., "local-model")
ExecStartPre=/usr/local/bin/lms load <model_key_from_lms_ls> --gpu max --context-length 8192 --identifier local-llm

# Main process: The HTTP server
ExecStart=/usr/local/bin/lms server start --port 1234 --cors

# Restart logic
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Key Configuration Details:

ExecStartPre: This is critical. The server itself is just a listener. If no model is loaded, API calls will fail or trigger a lazy load (which causes latency). Pre-loading ensures readiness.⁹
--identifier local-llm: This abstracts the complex GGUF filename (e.g., llama-3-8b-instruct-v2-q4_k_m.gguf) into a simple string local-llm. API clients will request this clean name.⁹
User: Must be your non-root user who downloaded the models, as models are stored in ~/.cache/lm-studio.

Enabling the Service:

Bash

sudo systemctl daemon-reload
sudo systemctl enable lmstudio
sudo systemctl start lmstudio

Check status with systemctl status lmstudio to ensure the model loaded successfully and the port is listening.¹⁰

4. The Presentation Layer: Open WebUI via Docker

Open WebUI (formerly Ollama WebUI) provides a polished, ChatGPT-like interface. It supports chat history, user accounts, RAG (document upload), and web search.

4.1 Docker Networking Strategy

A critical challenge in containerizing Open WebUI on Linux is establishing connectivity to the LM Studio service running on the host.

The Issue: On Linux, Docker containers do not resolve host.docker.internal by default.¹⁶
The Fix: We must inject this mapping using the extra_hosts directive in Docker Compose. This maps the internal hostname to the special host-gateway IP (usually 172.17.0.1), allowing the container to route traffic to the host's localhost ports.

4.2 Docker Compose Configuration

Create a directory for the project:

Bash

mkdir ~/ai-stack
cd ~/ai-stack
touch docker-compose.yml

docker-compose.yml Content:

YAML

version: '3.8'

services:
open-webui:
image: ghcr.io/open-webui/open-webui:main
container_name: open-webui
restart: always
ports:
- "3000:8080" # Expose Container Port 8080 to Host Port 3000
environment:
# Connection to LM Studio
# We use the /v1 suffix because LM Studio is OpenAI-compatible
- OPENAI_API_BASE_URL=http://host.docker.internal:1234/v1

# API Key: LM Studio doesn't strictly enforce it locally, but the client requires a value.
- OPENAI_API_KEY=lm-studio

# Security settings
- WEBUI_AUTH=true # Enable login screen
- ENABLE_SIGNUP=true # Enable for initial setup, set to false later for security

# UI Customization
- DEFAULT_MODELS=local-llm # Matches the identifier set in LM Studio Service

volumes:
- open-webui-data:/app/backend/data

# CRITICAL: Enable host resolution on Linux
extra_hosts:
- "host.docker.internal:host-gateway"

volumes:
open-webui-data:

Environment Variable Analysis:

OPENAI_API_BASE_URL: This defines the upstream backend. Since we are using the generic OpenAI driver in Open WebUI, we point it to the LM Studio API.⁸
WEBUI_AUTH: Ensures that the UI itself is password-protected. This is distinct from the Basic Auth we will add to the API later. This protects the chat history and administrative settings.¹⁸
DEFAULT_MODELS: Pre-selects the model we loaded via the lms load command.¹⁹

4.3 Deployment and Verification

Launch the container:

Bash

docker compose up -d

Monitor the logs to ensure it connects to the database and initializes:

Bash

docker compose logs -f

Once running, navigate to http://localhost:3000 in a local browser. Create the first admin account. Go to Settings > Connections and verify that the connection to http://host.docker.internal:1234/v1 is verified (green checkmark). If it fails, check the extra_hosts configuration and ensure the LM Studio service is active.²⁰

5. The Zero-Trust Gateway: Tailscale Funnel & Caddy

At this stage, we have a working local stack: Open WebUI on port 3000 talk to LM Studio on port 1234. Now, we must expose this securely to the internet.

5.1 Tailscale Funnel vs. Serve

Tailscale offers two exposure modes:

Serve: Exposes a service only to other devices inside the Tailnet (private mesh). This is the most secure method but requires the client device (e.g., a phone) to have the Tailscale app installed and active.²¹
Funnel: Exposes a service to the public internet. Tailscale accepts traffic on a public relay server and tunnels it to your machine.¹

The user request specifies "accessible over the Internet," which implies Funnel. However, exposing internal ports directly to the public web is risky. Therefore, we will use Caddy as an intermediary to secure the traffic.

Enabling Funnel:

Go to the Tailscale Admin Console > Access Controls.
Ensure the "Funnel" attribute is allowed for your user/node.
On the host machine, configure Funnel to route public port 443 traffic to local port 443 (where Caddy will listen).

Bash

sudo tailscale funnel 443 on

This command reserves the https://<node-name>.<tailnet>.ts.net domain for this machine and routes traffic to it.²

5.2 Caddy: The Reverse Proxy Manager

Caddy is chosen over Nginx because of its automatic TLS integration. Caddy can interface directly with the Tailscale socket to provision certificates for .ts.net domains automatically.⁴

Installation on Host: Install Caddy using the official.deb packages (as detailed in section 2.3). Do not run Caddy in Docker unless necessary, as mounting the Tailscale socket into Docker introduces permission complexities.²² Running it on the host is cleaner for this specific architecture.

5.3 Designing the Caddyfile

The Caddyfile is the configuration map. It needs to handle two scenarios:

Scenario A (Browser Chat): The user visits the root URL. Caddy forwards to Open WebUI. Open WebUI handles its own authentication.
Scenario B (API Call): A script requests /v1/chat/completions. Caddy forwards to LM Studio. Since LM Studio has no auth, Caddy MUST enforce Basic Authentication here.

Generating Passwords:

Bash

caddy hash-password --plaintext "my-secret-api-password"
# Output: $2a$14$.... (hash string)

The Configuration (/etc/caddy/Caddyfile):

Code snippet

# Define the Tailscale domain. Caddy detects the.ts.net suffix
# and attempts to fetch the cert from the local Tailscale daemon.
machine-name.tailnet-name.ts.net {

# 1. API Protection Block
# Match any request starting with /v1 (OpenAI standard path)
handle_path /v1/* {
# Security: Require Username/Password for API access
basic_auth {
# Format: <username> <hashed_password>
api_user $2a$14$Zkx19XLiW6VYouLHR5NmfOFU0z2GTNmpkT/5qqR7hx4IjWJPDhjvG
}

# Proxy to LM Studio on the host
# We rewrite the path because handle_path strips the prefix,
# but LM Studio EXPECTS /v1. So we use 'handle' or re-add it.
# BETTER APPROACH: Use 'handle /v1*' to keep the path intact.
}

# Correction on API Block Logic:
handle /v1* {
basic_auth {
api_user $2a$14$Zkx19XLiW6VYouLHR5NmfOFU0z2GTNmpkT/5qqR7hx4IjWJPDhjvG
}
reverse_proxy localhost:1234 {
# Essential for some API clients to respect the proxy
header_up Host {upstream_hostport}
}
}

# 2. Chat Interface Block (Default Fallback)
handle {
# Proxy to Docker container
reverse_proxy localhost:3000
}

# Logging for observability
log {
output file /var/log/caddy/access.log
}
}

Understanding the Logic:

handle /v1* vs handle: Caddy evaluates specific matchers first. If the path matches /v1..., it enters the API block. This block enforces basic_auth.²³ If authentication passes, it forwards to port 1234 (LM Studio).
Security Implication: This effectively adds a password to your LM Studio API. Without this, anyone who guessed your URL could use your GPU to generate text, potentially incurring costs or blocking your usage.
Socket Access: For Caddy to get the certs from Tailscale, the caddy user needs permission to talk to the Tailscale socket.
Bash
sudo usermod -aG tailscale caddy
# Or modify systemd service to give access

If this is difficult, Caddy can also just function as a standard HTTPS server if Tailscale Funnel is terminating TLS. However, for end-to-end encryption and proper certificate handling, letting Caddy manage the cert via the socket is best.⁴

Applying Changes:

Bash

sudo systemctl reload caddy

6. End-to-End Integration and Testing

We now have the complete pipeline:

Internet -> Tailscale Funnel -> Caddy (Auth/Route) ->

6.1 Verifying the Chat Interface

Disconnect your testing device (laptop/phone) from the home Wi-Fi. Connect via a different network (e.g., 5G hotspot).
Open a browser and navigate to https://machine-name.tailnet-name.ts.net.
Checkpoint: You should see the Open WebUI login page.
Log in. Select the model "local-llm" from the dropdown.
Send a message: "What is the capital of France?"
Success: The system should respond. Check docker logs open-webui to see the request hit the container, and journalctl -u lmstudio to see the inference happening on the GPU.

6.2 Verifying the Secure API

To test the API endpoint, we use curl. This simulates an external app (like a VS Code extension or a custom Python script) connecting to your backend.

Test 1: Unauthenticated (Should Fail)

Bash

curl -I https://machine-name.tailnet-name.ts.net/v1/models

Expected Result: HTTP/2 401 Unauthorized. This confirms Caddy is protecting your API.²³

Test 2: Authenticated (Should Succeed)

Bash

curl -X GET https://machine-name.tailnet-name.ts.net/v1/models \
-u "api_user:my-secret-api-password"

Expected Result: HTTP/2 200 OK and a JSON list of models, including local-llm.

Test 3: Chat Completion

Bash

curl -X POST https://machine-name.tailnet-name.ts.net/v1/chat/completions \
-u "api_user:my-secret-api-password" \
-H "Content-Type: application/json" \
-d '{
"model": "local-llm",
"messages": [
{ "role": "user", "content": "Hello, world!" }
]
}'

Expected Result: A JSON response containing the generated text "Hello! How can I help you today?".

7. Advanced Security and Maintenance

7.1 Securing Open WebUI

While Open WebUI has built-in auth, for enterprise-grade security, consider disabling the native auth and using an SSO provider or simply relying on Tailscale identity if switching to "Serve" mode.

Disable Sign-up: Once you have created your admin account, update docker-compose.yml:
YAML
- ENABLE_SIGNUP=false

Redeploy with docker compose up -d. This prevents strangers who stumble upon your URL from creating accounts.¹⁸

7.2 Rate Limiting via Caddy

To prevent Denial of Service (DoS) attacks on your GPU, configure rate limiting in Caddy.

Code snippet

handle /v1* {
rate_limit {
zone api_zone 10mb
events 10
window 1m
}
#... rest of config
}

Note: Rate limiting requires the rate-limit module for Caddy, which may require a custom build. Alternatively, use Fail2Ban on the host to monitor Caddy access logs for 401 errors.

7.3 Auto-Updating Models

GGUF models are updated frequently. Create a cron job or a simple script to update your model:

Bash

#!/bin/bash
# update_model.sh
systemctl stop lmstudio
lms get llama-3-8b-instruct --force
systemctl start lmstudio

8. Troubleshooting Guide

8.1 Docker Container Cannot Connect to Host

Symptom: Open WebUI shows "Connection Error" or "Offline".
Diagnosis: The container cannot resolve host.docker.internal.
Fix: Check docker-compose.yml. Ensure extra_hosts is present. Verify the gateway IP by running docker network inspect bridge and ensuring it matches the gateway (default 172.17.0.1).¹⁶

8.2 Tailscale Funnel Not Working

Symptom: Public URL times out.
Diagnosis: Relay servers are blocked or ACLs restrict Funnel.
Fix:

Run tailscale funnel status to check if it's active.
Check Tailscale Admin Console > Access Controls. Ensure your user is allowed to use Funnel.
Restart Tailscale: sudo systemctl restart tailscaled.

8.3 Caddy 502 Bad Gateway

Symptom: Caddy returns 502 when accessing the site.
Diagnosis: The upstream service (LM Studio or Docker) is down.
Fix:

Check if LM Studio is running: systemctl status lmstudio.
Check if Docker is running: docker ps.
Verify ports: netstat -tuln. Ensure ports 1234 and 3000 are listening on 127.0.0.1 (or ::1).

8.4 GPU Not Used

Symptom: Slow generation speed (< 5 tok/s).
Diagnosis: LM Studio is running on CPU.
Fix:

Check nvidia-smi to see if lms process is using VRAM.
Check journalctl -u lmstudio for logs indicating "CUDA not found" or "Offloading 0 layers".
Ensure the lms load command in the systemd service includes --gpu max.

9. Conclusion

This architecture represents a mature, sovereign alternative to cloud AI APIs. By leveraging Tailscale Funnel, we have eliminated the most dangerous aspect of self-hosting—opening firewall ports—while maintaining global accessibility. Caddy provides the necessary security layer, wrapping a raw internal API in HTTPS and Basic Authentication. Docker and LM Studio decouple the application logic from the inference runtime, allowing for independent scaling and updates.

The result is a robust, personal AI cloud that is secure by design, privacy-preserving by default, and capable of delivering commercial-grade LLM experiences on your own terms.

Sovereign AI Infrastructure: A Comprehensive Guide to Architecting Secure, Self-Hosted LLM Services

1. Introduction: The Paradigm Shift to Localized Intelligence

1.1 Architectural Overview

1.2 Hardware Requirements and Capacity Planning

2.1 Host Environment Preparation

2.2 Docker Engine and Compose

2.3 Tailscale Network Layer

3.1 Installation and the lms CLI

3.2 Model Management: Selection and Quantization

3.3 Server Architecture and Configuration

3.4 Persistence: Systemd Service Implementation

4.1 Docker Networking Strategy

4.2 Docker Compose Configuration

4.3 Deployment and Verification

5.1 Tailscale Funnel vs. Serve

5.2 Caddy: The Reverse Proxy Manager

5.3 Designing the Caddyfile

6.1 Verifying the Chat Interface

6.2 Verifying the Secure API

7.1 Securing Open WebUI

7.2 Rate Limiting via Caddy

7.3 Auto-Updating Models

8.1 Docker Container Cannot Connect to Host

8.2 Tailscale Funnel Not Working

8.3 Caddy 502 Bad Gateway

8.4 GPU Not Used

9. Conclusion

Keep Reading

Zap Create