271 lines
9.3 KiB
Markdown
271 lines
9.3 KiB
Markdown
# vSphere Backup Manager — Enterprise Roadmap
|
||
|
||
## Current State ✅
|
||
|
||
The backup engine is working. It connects to vCenter, creates crash-consistent snapshots, downloads full VMDK flat disk data and VMX configs, and runs scheduled recurring jobs — all accessible via a modern Flask web UI.
|
||
|
||
---
|
||
|
||
## Priority 1 — Core Reliability & Persistence
|
||
|
||
These are **non-negotiable** for a production system. Without them, the tool is still a "hobby project."
|
||
|
||
### 1.1 — Persistent Job Store
|
||
|
||
> **Why:** Currently everything is in RAM. A PM2 restart wipes all job history and kills all schedules.
|
||
|
||
- Save `jobs` dict to `jobs.json` on every state change (create, status update, completion)
|
||
- On app startup, load `jobs.json` and re-register all `scheduled` jobs into APScheduler
|
||
- Impact: **Zero job loss across restarts**
|
||
|
||
### 1.2 — Backup Retention Policies
|
||
|
||
> **Why:** Without retention, the backup disk fills up forever.
|
||
|
||
- Per-job retention rules: keep last **N** full backups, or keep backups no older than **X days**
|
||
- Auto-purge old backup directories after a new backup completes successfully
|
||
- Show retention info and countdown on the Jobs dashboard
|
||
- Impact: **Prevents disk exhaustion**, critical for unattended operation
|
||
|
||
### 1.3 — Email / Webhook Notifications
|
||
|
||
> **Why:** Admins can't watch a dashboard 24/7.
|
||
|
||
- Send email (SMTP) on: backup success, failure, or warning
|
||
- Send webhook (Slack, Teams, generic HTTP) on job completion
|
||
- Configurable per-job or globally
|
||
- Impact: **Instant alerting** on failures
|
||
|
||
---
|
||
|
||
## Priority 2 — Backup Integrity & Verification
|
||
|
||
A backup that can't be verified is a liability, not an asset.
|
||
|
||
### 2.1 — Checksum Verification
|
||
|
||
> **Why:** Bit-rot, network corruption, or a partial write can silently corrupt a backup.
|
||
|
||
- After each file download, compute **SHA-256** of the downloaded file
|
||
- Store checksums in a `manifest.json` next to each backup
|
||
- Optionally verify checksums before an upload or restore
|
||
|
||
### 2.2 — Backup Manifest & Catalog
|
||
|
||
> **Why:** You need a machine-readable record of every backup for audit and restore.
|
||
|
||
Each backup produces a `manifest.json`:
|
||
|
||
```json
|
||
{
|
||
"job_id": "...",
|
||
"vm_name": "Nakivo",
|
||
"started": "2026-06-22T01:52:00Z",
|
||
"finished": "2026-06-22T03:10:44Z",
|
||
"vcenter": "vcsa.noc.pens.ac.id",
|
||
"snapshot": "backup-1782067446",
|
||
"files": [
|
||
{ "path": "Nakivo/Nakivo.vmdk", "size_bytes": 491, "sha256": "..." },
|
||
{
|
||
"path": "Nakivo/Nakivo-flat.vmdk",
|
||
"size_bytes": 17179869184,
|
||
"sha256": "..."
|
||
},
|
||
{ "path": "Nakivo/Nakivo.vmx", "size_bytes": 3065, "sha256": "..." }
|
||
]
|
||
}
|
||
```
|
||
|
||
### 2.3 — Test Restore (Dry-Run)
|
||
|
||
> **Why:** The only way to know a backup works is to try restoring it.
|
||
|
||
- "Verify Backup" button in the UI
|
||
- Checks: manifest exists, all files present, SHA-256 matches, disk size matches vCenter
|
||
- Optionally: power on the VM in an isolated network (advanced)
|
||
|
||
---
|
||
|
||
## Priority 3 — Backup Strategies (Storage Efficiency)
|
||
|
||
### 3.1 — Incremental / Changed Block Tracking (CBT)
|
||
|
||
> **Why:** Downloading a full 16 GB disk every night is inefficient. CBT lets you only transfer **changed blocks**.
|
||
|
||
- Enable VMware CBT (`changeTrackingEnabled`) on the VM
|
||
- Use `vim.VirtualDisk.QueryChangedDiskAreas()` to get only changed extents
|
||
- Download only the changed byte ranges from the flat VMDK (HTTP Range requests)
|
||
- Store deltas alongside the full base backup
|
||
- Impact: **80–99% reduction** in daily backup transfer size
|
||
|
||
> [!IMPORTANT]
|
||
> This is the #1 differentiator between amateur and enterprise backup tools.
|
||
|
||
### 3.2 — Deduplication
|
||
|
||
> **Why:** Multiple VMs often share identical OS blocks.
|
||
|
||
- Block-level deduplication using content hashing (e.g., SHA-256 per 4 MB block)
|
||
- Store a deduplicated block store; backups reference blocks by hash
|
||
- Tools: integrate with `zfs send` (if on ZFS) or implement a simple local content-addressable store
|
||
|
||
### 3.3 — Compression
|
||
|
||
> Already implemented (`zstd`), but integrate tighter with CBT deltas for per-block compression.
|
||
|
||
---
|
||
|
||
## Priority 4 — Security & Multi-User
|
||
|
||
### 4.1 — Encrypted Credential Storage
|
||
|
||
> **Why:** Currently vCenter passwords are in Flask signed cookies (not encrypted).
|
||
|
||
- Store credentials in server-side encrypted store (e.g., using `cryptography.fernet`)
|
||
- Never transmit plaintext passwords to frontend JavaScript
|
||
- Support environment variable injection (`VCENTER_PASSWORD`)
|
||
|
||
### 4.2 — Role-Based Access Control (RBAC)
|
||
|
||
> **Why:** In an enterprise, not everyone should have the same access.
|
||
|
||
| Role | Permissions |
|
||
| -------- | ---------------------------------------------------------------- |
|
||
| Admin | Full access — create/delete jobs, manage schedules, view all VMs |
|
||
| Operator | Start/stop jobs, view logs, cannot change schedules |
|
||
| Viewer | Read-only dashboard access |
|
||
|
||
- Local user accounts stored in a SQLite database with bcrypt-hashed passwords
|
||
- Simple session-based auth or JWT tokens
|
||
|
||
### 4.3 — Audit Log
|
||
|
||
> **Why:** Who ran a backup? Who deleted a job? Essential for compliance.
|
||
|
||
- Persistent append-only audit log
|
||
- Records: user, action, VM, timestamp, result
|
||
- Viewable in the UI with filtering
|
||
|
||
---
|
||
|
||
## Priority 5 — Operations & Monitoring
|
||
|
||
### 5.1 — REST API
|
||
|
||
> **Why:** Integrate with Ansible, Terraform, CI/CD pipelines, or your own monitoring system.
|
||
|
||
Expose a full REST API:
|
||
|
||
```
|
||
GET /api/v1/jobs — list all jobs
|
||
POST /api/v1/jobs — create job
|
||
GET /api/v1/jobs/{id} — job status + progress
|
||
POST /api/v1/jobs/{id}/cancel — cancel job
|
||
GET /api/v1/vms — list VMs
|
||
GET /api/v1/backups — list completed backups with manifests
|
||
POST /api/v1/backups/{id}/verify — trigger checksum verify
|
||
```
|
||
|
||
Include API key authentication (`X-API-Key` header).
|
||
|
||
### 5.2 — Metrics & Dashboard (Prometheus/Grafana)
|
||
|
||
> **Why:** At-a-glance health visibility across all backup jobs.
|
||
|
||
- Expose a `/metrics` endpoint (Prometheus format)
|
||
- Metrics: `backup_duration_seconds`, `backup_size_bytes`, `backup_success_total`, `backup_failure_total`
|
||
- Build a Grafana dashboard for the backup operations team
|
||
|
||
### 5.3 — Multi-vCenter Support
|
||
|
||
> **Why:** Enterprises run multiple vCenter clusters.
|
||
|
||
- Support multiple saved vCenter connections (not just session-based)
|
||
- Jobs can target VMs across different vCenter instances
|
||
- Unified jobs dashboard across all vCenters
|
||
|
||
### 5.4 — Storage Backend Plugins
|
||
|
||
> **Why:** Not everyone stores backups on local NFS.
|
||
|
||
| Backend | Use Case |
|
||
| ---------------- | --------------------------------- |
|
||
| NFS (current) | On-prem NAS |
|
||
| S3 / MinIO | Object storage (on-prem or cloud) |
|
||
| Azure Blob | Azure-hosted environments |
|
||
| Rclone (generic) | 60+ cloud providers |
|
||
|
||
---
|
||
|
||
## Priority 6 — Disaster Recovery Features
|
||
|
||
### 6.1 — Instant VM Recovery
|
||
|
||
> **Why:** RTO (Recovery Time Objective) of minutes, not hours.
|
||
|
||
- Register the downloaded VMDK directly back to vCenter without full copy
|
||
- Use `RegisterVM_Task` on the downloaded `.vmx` pointing to the backup directory
|
||
- If backup is on NFS, this is near-instant (no copy needed)
|
||
|
||
### 6.2 — Restore Wizard
|
||
|
||
> Add a "Restore" tab to the UI
|
||
|
||
- Browse backup catalog → select VM → select restore point → choose target host/datastore
|
||
- Options: restore in-place (overwrite) or restore as new VM (clone)
|
||
- Track restore progress like backup progress
|
||
|
||
### 6.3 — Off-site Replication
|
||
|
||
> **Why:** 3-2-1 backup rule: 3 copies, 2 different media, **1 offsite**.
|
||
|
||
- After backup completes, replicate to a secondary NFS, S3, or SFTP target
|
||
- Run replication in parallel or sequential
|
||
- Alert if replication fails even if backup succeeded
|
||
|
||
---
|
||
|
||
## Priority 7 — UI/UX Polish
|
||
|
||
### 7.1 — Backup Calendar View
|
||
|
||
- Visual calendar showing which VMs were backed up on which days
|
||
- Color-coded: green = success, red = failure, yellow = warning
|
||
|
||
### 7.2 — Storage Analytics
|
||
|
||
- Pie chart / bar chart: backup size per VM, storage growth over time
|
||
- Alert when NFS mount is above 80% full
|
||
|
||
### 7.3 — Live Progress Streaming (SSE/WebSocket)
|
||
|
||
> **Why:** Currently the log page requires polling. Server-Sent Events provide true live streaming.
|
||
|
||
- Replace AJAX polling with `EventSource` (SSE) for real-time log updates
|
||
- Show a live progress bar with phase labels: Connecting → Snapshot → Downloading → Compressing → Done
|
||
|
||
---
|
||
|
||
## Recommended Implementation Order
|
||
|
||
```mermaid
|
||
graph LR
|
||
A["1.1 Persistent Jobs"] --> B["1.2 Retention Policies"]
|
||
B --> C["1.3 Notifications"]
|
||
C --> D["2.1 Checksums"]
|
||
D --> E["3.1 CBT Incremental ⭐"]
|
||
E --> F["4.1 Encrypted Creds"]
|
||
F --> G["5.1 REST API"]
|
||
G --> H["6.2 Restore Wizard"]
|
||
```
|
||
|
||
| Phase | Features | Effort | Impact |
|
||
| ----------- | --------------- | -------- | ------------------------------------- |
|
||
| **Phase 1** | 1.1 + 1.2 + 1.3 | ~2 days | Survives restarts, alerts on failures |
|
||
| **Phase 2** | 2.1 + 2.2 + 5.1 | ~3 days | Trusted backups, API integration |
|
||
| **Phase 3** | 3.1 (CBT) | ~1 week | Game-changer: 90% less bandwidth |
|
||
| **Phase 4** | 4.1 + 4.2 + 4.3 | ~1 week | Enterprise security & compliance |
|
||
| **Phase 5** | 6.2 + 5.4 + 5.2 | ~2 weeks | Full DR capability |
|
||
|
||
notification with timeout (in case its failed the snapshot is not cleaned)
|