Back to Feed
Software
SwiftDeploy: Building an Observable, Policy-Driven Deployment Engine with OPA
Introduction As part of the HNG Internship DevOps Track Stage 4B, I extended my Stage 4A project — SwiftDeploy — into a fully observable, policy-aware deployment platform. In Stage 4A, SwiftDeploy could: - generate infrastructure files from a declarative manifest - deploy containers using Docker Compose - manage deployment modes (stable/canary) - configure Nginx automatically Stage 4B transformed it into something much closer to a real production deployment system by adding: - Prometheus instrumentation - Open Policy Agent (OPA) policy enforcement - live operational dashboards - deployment safety gates - audit logging and reporting - chaos engineering validation The result is a deployment tool that not only deploys services, but also decides whether deployments are safe enough to proceed. The Core Philosophy: One Manifest, Everything Else Generated SwiftDeploy is built around a single principle: manifest.yaml is the only file you should ever edit manually. Everything else is generated from it. Here is the manifest structure: services: name: app image: swift-deploy-1-node:latest port: 3000 version: "1.0.0" mode: stablenginx: image: nginx:latest port: 8080 proxy_timeout: 30network: name: swiftdeploy-net driver_type: bridge From this manifest, the CLI generates: - generated/nginx.conf - generated/docker-compose.yml - OPA runtime configuration This design provides: - consistency - reproducibility - environment portability - infrastructure-as-code discipline The grader can delete all generated files and rerun: ./swiftdeploy init and the entire stack regenerates correctly. Architecture Overview The system architecture consists of four major components: User ↓Nginx Reverse Proxy ↓Flask API Service ↓Prometheus Metrics ↓SwiftDeploy CLI ↓OPA Policy Engine The deployment stack includes: - Flask application container - Nginx reverse proxy - Open Policy Agent (OPA) - internal Docker network - named log volumes The SwiftDeploy CLI The heart of the project is the swiftdeploy executable. It is a Python-based CLI tool that manages the entire deployment lifecycle. Supported Commands CommandPurposeinitGenerate config files from templatesvalidateRun pre-flight validation checksdeployStart the stackpromote canarySwitch deployment into canary modepromote stableReturn deployment to stable modestatusLive metrics dashboardauditGenerate audit reportteardownDestroy containers and networks The API Service The API service is a Flask application that supports both stable and canary deployment modes. Deployment mode is controlled through the MODE environment variable. Endpoints Root Endpoint GET / Returns: - deployment mode - version - timestamp Example: { "message": "Welcome to SwiftDeploy", "mode": "stable", "version": "1.0.0"} Health Endpoint GET /healthz Returns: - health status - application uptime Chaos Endpoint POST /chaos Available only in canary mode. Supports: { "mode": "slow", "duration": 3 } { "mode": "error", "rate": 0.5 } { "mode": "recover" } This endpoint was used to simulate: - degraded latency - random failures - recovery workflows Instrumentation: The /metrics Endpoint One of the biggest upgrades in Stage 4B was observability. I instrumented the Flask service using the prometheus_client library. The service now exposes: GET /metrics in Prometheus text format. Metrics Collected Request Throughput http_requests_total Labels: method path status_code Example: http_requests_total{method="GET",path="/",status_code="200"} 152 Request Latency http_request_duration_seconds Histogram used for: - latency analysis - P99 calculation Application Uptime app_uptime_seconds Tracks process uptime. Deployment Mode app_mode Values: - 0 = stable - 1 = canary Chaos State chaos_active Values: - 0 = none - 1 = slow - 2 = error Why Metrics Matter Without metrics: - deployments are blind - failures become invisible - canary safety cannot be enforced Metrics became the foundation for: - policy decisions - dashboards - auditing - promotion safety Open Policy Agent (OPA): The Brain of SwiftDeploy The most important design principle in Stage 4B was: The CLI must never make allow/deny decisions itself. All decision-making lives entirely inside OPA. SwiftDeploy only: - gathers data - sends context to OPA - acts on the response This separation makes the system: - modular - secure - maintainable - extensible OPA Policy Domains I separated policies into independent domains. Each policy: - answers one question - owns its own logic - operates independently Infrastructure Policy Runs before deployment. Blocks deployment when: - disk free space is below 10GB - CPU load exceeds 2.0 Rego Example package infradefault allow = falseallow { input.disk_free_gb >= data.thresholds.disk_free_gb input.cpu_load <= data.thresholds.cpu_load} Canary Safety Policy Runs before promotion. Blocks promotion when: - error rate exceeds 1% - P99 latency exceeds 500ms Rego Example package canarydefault allow = falseallow { input.error_rate <= data.thresholds.error_rate input.p99_latency_ms <= data.thresholds.p99_latency_ms} Policy Thresholds Thresholds are stored separately in: policies/data.json Example: { "thresholds": { "disk_free_gb": 10, "cpu_load": 2.0, "error_rate": 0.01, "p99_latency_ms": 500 }} This prevents: - hardcoded values - duplicated configuration - policy coupling OPA Isolation The OPA container runs on an internal Docker network. It is intentionally NOT exposed through Nginx. Only the CLI can access OPA directly via: http://localhost:8181 This prevents external users from: - querying policies - bypassing deployment logic - inspecting internal rules This mirrors real production security architecture. Pre-Deploy Policy Enforcement Before deployment, SwiftDeploy collects: - CPU load - available disk space Example payload: { "disk_free_gb": 8.5, "cpu_load": 2.4} OPA evaluates the payload. If policies fail: Deployment blocked:Infrastructure policy violation The deployment never proceeds. Canary Safety Enforcement Before promotion, SwiftDeploy: - scrapes /metrics - calculates error rate - calculates P99 latency - submits metrics to OPA If the canary is unhealthy: - promotion is blocked - rollout is prevented This introduces production-grade deployment safety. The Status Dashboard The status command provides a live operational dashboard. ./swiftdeploy status The dashboard: - refreshes continuously - scrapes live metrics - calculates request rate - calculates P99 latency - evaluates policy compliance - appends results to history.jsonl Example output: SwiftDeploy Status Dashboard==================================================Mode: canaryChaos: errorError Rate: 52%P99 Latency: 430msPolicy Compliance:✓ Infrastructure policy: PASSING✗ Canary safety policy: FAILING Chaos Engineering This was one of the most interesting parts of the project. I intentionally injected: - high error rates - slow responses Example: curl -X POST http://localhost:8080/chaos -d '{"mode":"error","rate":0.9}' Immediately: - metrics reflected failures - policies began failing promotions were blocked This validated that:metrics were accurate policies were functional safety gates worked correctly Audit Logging Every: - deploy - promote - status scrape - policy violation is appended to: history.jsonl Example entry: { "timestamp": "2026-05-06T12:00:00", "mode": "canary", "error_rate": 0.52} Audit Report Generation Running: ./swiftdeploy audit generates: audit_report.md The report includes: - deployment timeline - mode changes - chaos injections - policy violations Example: | Timestamp | Policy | Details ||-----------|--------|---------|| 2026-05-06T00:47:10Z | Canary Safety | error_rate=50% | Challenges Faced a. Python Virtual Environment Issues Ubuntu’s externally-managed Python environment caused repeated package installation failures. The solution was: - recreating the virtual environment - installing dependencies inside the venv only b. Nginx Validation Problems Generated Nginx configs initially failed validation due to unresolved upstream references. Fix: - validate only inside container context - avoid host-side upstream resolution c. Metrics Parsing - Calculating: - error rate - P99 latency from Prometheus text format required careful parsing and aggregation. d. OPA Failure Handling The CLI had to gracefully handle: - OPA downtime - connection failures - malformed responses The system never crashes when OPA becomes unavailable. Lessons Learned Declarative Systems Scale Better A single source of truth drastically reduces configuration drift. Observability Is Mandatory Without metrics: - policy enforcement becomes impossible - deployments become blind - Policy Engines Should Be Isolated - Keeping OPA internal-only mirrors real enterprise architectures. Chaos Engineering Builds Confidence Breaking the system intentionally proved that: - metrics were accurate - policies were effective - safety mechanisms worked Automation Must Be Explainable Every policy response included human-readable reasoning. This made debugging and operational decisions much easier. Final Thoughts Stage 4B transformed SwiftDeploy from a deployment generator into a lightweight deployment platform with: - observability - governance - auditing - deployment safety The project demonstrated how: - metrics - policy engines - infrastructure generation - deployment orchestration can work together to create reliable deployment systems. Most importantly, it reinforced a key DevOps principle: Safe automation is more valuable than fast automation.