OpenVPN Production Reference Architecture (No-Surprises Hardened Design)
OpenVPN Production Reference Architecture (No-Surprises Hardened Design)
Purpose
This document defines a production-ready OpenVPN architecture with explicit operational failure modes, high availability design, and safe default configurations. It is designed to eliminate common misinterpretations that lead to outages, security gaps, and false assumptions in enterprise deployments.
This is not a compliance certification. It is a hardened engineering reference implementation aligned with Zero Trust principles and commonly interpreted NIST 800-53 / 800-207 control objectives.
1. High-Level Architecture
┌──────────────────────────┐
│ Active Directory (AD) │
└─────────────┬────────────┘
│
┌─────▼─────┐
│ NPS │ (RADIUS Authentication)
└─────┬─────┘
│
│ RADIUS
▼
┌──────────────┐ TLS ┌──────────────┐
│ VPN Client │────────────►│ OpenVPN Pool │◄────────────┐
└──────────────┘ │ (HA Nodes) │ │
│ └──────┬───────┘ │
│ Cert + MFA │ │
▼ ▼ │
MFA Provider Internal Network │
│ PKI (Step-CA HA)
▼ │
Authentication Layer ▼
Certificate Issuance / Renewal
2. OpenVPN Production Baseline Configuration (Hardened)
Server Configuration (Validated Safe Baseline)
port 1194
proto udp
dev tun
# PKI
ca ca.crt
cert server.crt
key server.key
# Cryptography baseline
tls-version-min 1.2
auth SHA256
cipher AES-256-GCM
ncp-ciphers AES-256-GCM:AES-128-GCM
# TLS protection (mandatory)
tls-crypt ta.key
# Identity enforcement
remote-cert-tls client
verify-x509-name client name-prefix
# Network hygiene
user nobody
group nogroup
persist-key
persist-tun
# Stability
keepalive 10 60
explicit-exit-notify 1
# Logging
verb 3
status /var/log/openvpn-status.log
log-append /var/log/openvpn.log
# Auth integration (choose one)
# PAM
# plugin /usr/lib/openvpn/openvpn-plugin-auth-pam.so openvpn
# RADIUS (NPS)
# plugin /usr/lib/openvpn/radiusplugin.so /etc/openvpn/radiusplugin.cnf
Client Configuration (Hardened)
client
dev tun
proto udp
remote vpn.example.com 1194
resolv-retry infinite
nobind
persist-key
persist-tun
remote-cert-tls server
cipher AES-256-GCM
auth SHA256
verb 3
3. High Availability Dependency Chain
Critical Path
Client → DNS → Load Balancer → OpenVPN Node → Auth (NPS/PAM) → AD/MFA
│
▼
PKI (Step-CA HA)
HA Targets (Operational Objectives)
| Metric | Target |
|---|---|
| VPN RTO | < 15 minutes |
| VPN RPO | 0 (no session persistence required) |
| Auth availability | 99.9%+ |
PKI High Availability Design
Primary Step-CA node
Secondary standby CA (hot-standby or replicated DB)
Offline root CA (air-gapped)
Automated backup of the CA state every 24h
NPS Redundancy
At least 2 NPS servers
Load-balanced via DNS round-robin or RADIUS failover list
AD replication is required across sites
4. Failure Mode & Impact Table
| Failure | Impact | Behavior | Recovery |
|---|---|---|---|
| NPS down | Auth failure | Immediate denial | Failover NPS |
| MFA down | Auth failure | Immediate denial | Break-glass only |
| PKI down | No new certs | Existing users OK | Restore CA |
| CRL stale | Revocation delayed | Risk window exists | Reload required |
| OpenVPN node failure | Partial outage | Failover required | LB reroute |
| DNS failure | Full outage | No connection | DNS restore |
| TLS-crypt mismatch | Total failure | No handshake | redeploy key |
5. Certificate & MFA Failure Scenarios
Certificate
Expiration Failure
Bulk outage at expiry boundary
Prevented via automated renewal monitoring
Revocation Delay
CRL is NOT real-time
The risk window exists until reload or reconnect
MFA
Provider outage
Full authentication failure
Requires break-glass access
Time drift
TOTP failures globally
Prevented via strict NTP enforcement
6. Break-Glass Access (FULL DESIGN)
Method 1: Offline Emergency Account
Disabled by default
Stored in a secure vault (offline or PAM vault)
Requires console activation
Method 2: Certificate-only VPN profile
No MFA dependency
Restricted IP range
Firewall-limited access
Method 3: Out-of-band access
iDRAC / iLO / hypervisor console
Independent of the VPN stack
Controls
All break-glass usage logged
Time-bound activation (auto-expire)
Requires dual approval in enterprise environments
7. Identity-Based Segmentation (Correct Model)
OpenVPN does NOT support VLAN tagging.
Enforcement Methods
CCD (Client Config Directory)
Static IP pools per identity group
Firewall policy enforcement
Example
| Group | IP Range | Policy |
|---|---|---|
| Admins | 10.8.0.0/24 | Full |
| Users | 10.8.1.0/24 | Limited |
| Contractors | 10.8.2.0/24 | Restricted |
8. Operational Runbooks
VPN Node Failure
Remove node from LB
Redirect traffic to the healthy node
Investigate logs
Auth Failure (NPS)
Failover to secondary NPS
Check AD replication
Validate RADIUS health
PKI Failure
Stop issuance
Restore from backup
Restart CA services
9. Monitoring & Health Checks
Required Metrics
VPN active sessions
Auth success/failure rate
CPU/memory on VPN nodes
Disk usage (log safety)
Certificate expiration window
Alert Thresholds
Auth failure rate > 10% in 5 min → critical
Disk usage > 80% → warning
Certificate expiry < 14 days → alert
10. Production Security Baseline Checklist (FULL)
This checklist defines the minimum enforceable controls required for a production OpenVPN deployment. Every item must be implemented and validated in staging before production release.
Identity
Client certificate authentication is mandatory for all VPN connections (no password-only authentication permitted)
The certificate authority is stored in a restricted administrative environment with an offline backup of the root CA
MFA is enforced for all interactive authentication flows via PAM or RADIUS (NPS)
AD or a centralized identity provider is the source of truth for all user lifecycle management
Group membership is used for authorization decisions (not local OpenVPN config logic)
Break-glass accounts exist, are stored offline, and are tested quarterly
Identity deprovisioning removes VPN access within 24 hours of account disablement
Cryptography
TLS-crypt is enabled for all VPN connections to protect the control channel traffic
TLS minimum version is enforced at 1.2 or higher; TLS 1.0 and 1.1 are disabled
Cipher suite is restricted to AES-256-GCM (preferred) or AES-128-GCM (fallback)
HMAC authentication uses SHA256 or stronger
Forward secrecy is enabled through modern TLS negotiation
Weak ciphers, static keys, and legacy renegotiation are disabled
Certificate key strength is a minimum of RSA 2048-bit or ECDSA P-256
PKI
Certificate authority architecture includes an offline root CA and an online intermediate CA
Certificate issuance is automated via Step-CA or an equivalent secure CA system
Certificate validity periods are short-lived (recommended 7–30 days for clients)
The automated renewal process is tested and monitored for failure conditions
CRL generation and distribution are automated and validated at least every 24 hours
The revocation process is documented and tested with measured propagation delay
CA private keys are stored in encrypted storage or hardware-backed security modules (HSM, where possible)
Backup and disaster recovery procedures for PKI are tested quarterly
Network
VPN IP pools are explicitly defined and segmented per identity group
No overlapping IP ranges exist across VPN pools or internal subnets
CCD (Client Config Directory) or equivalent policy-based assignment is used for segmentation
Firewall rules enforce least-privilege access between VPN user groups
DNS resolution for VPN clients is restricted to internal resolvers only
Split tunneling is explicitly defined and disabled unless the business justifies it
Routing tables are explicitly documented and validated for each VPN segment
High Availability (HA)
At least two OpenVPN nodes are deployed in active/active or active/passive configuration
A load balancer or failover mechanism (HAProxy or keepalived) is implemented and tested
NPS/RADIUS authentication servers are deployed in a redundant configuration
Active Directory replication is validated across authentication sites
PKI system includes a recovery or standby capability for certificate issuance
DNS redundancy is configured for VPN endpoint resolution
Failure of any single VPN node does not result in a full service outage
HA failover has been tested under simulated real-world outage conditions
Logging & Monitoring
Authentication success and failure events are logged and forwarded to SIEM
VPN session start, stop, and duration logs are collected centrally
MFA success and failure events are logged and correlated with authentication attempts
Certificate issuance, renewal, and revocation events are logged
Logs are retained for a minimum of 90 days in hot storage and longer in archive storage per policy
Log storage is protected against tampering using append-only or immutable storage mechanisms
Time synchronization (NTP) is enforced across all infrastructure components
Disk usage monitoring is implemented to prevent log exhaustion outages
Alerts are configured for authentication anomalies, brute-force attempts, and MFA failures
Operations
All configuration changes are managed through formal change control processes
Infrastructure-as-code or version-controlled configuration management is enforced where possible
OpenVPN configuration changes are tested in staging before production deployment
Restart and reload behavior has been validated for the specific OpenVPN version in use
Rollback procedures are documented, tested, and executable within the defined RTO
Dependency failures (MFA, NPS, PKI, DNS) are documented with known impact behavior
Patch management for VPN servers is scheduled and tested for compatibility
Administrative access to the VPN infrastructure is restricted and audited
Recovery
Break-glass access mechanisms are defined, tested, and secured offline
Out-of-band management access (iDRAC, iLO, hypervisor console) is available for all VPN nodes
The full VPN outage recovery procedure is documented step-by-step
Recovery procedures include DNS restoration, authentication failover, and PKI restoration steps
Recovery drills are performed at least annually to validate operational readiness
Backup restoration procedures for PKI and VPN configuration are tested regularly
Emergency access usage is logged, reviewed, and time-restricted
11. Key Engineering Warnings (DO NOT IGNORE) Key Engineering Warnings
OpenVPN does NOT enforce AD policies directly
CRL is NOT real-time revocation
TLS-crypt failure causes total outage
MFA dependency failure causes a full lockout
IP segmentation requires a CCD or firewall enforcement
Reload behavior is environment-dependent
Final Statement
This architecture eliminates ambiguity by explicitly defining:
What fails
How it fails
How to recover
What dependencies exist
It is intended as a production engineering standard, not a theoretical security model.
Comments
Post a Comment
Got something to say? Drop a comment below — let’s chat!