vSphere-Backup-Manager/todo.md

271 lines
9.3 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# vSphere Backup Manager — Enterprise Roadmap
## Current State ✅
The backup engine is working. It connects to vCenter, creates crash-consistent snapshots, downloads full VMDK flat disk data and VMX configs, and runs scheduled recurring jobs — all accessible via a modern Flask web UI.
---
## Priority 1 — Core Reliability & Persistence
These are **non-negotiable** for a production system. Without them, the tool is still a "hobby project."
### 1.1 — Persistent Job Store
> **Why:** Currently everything is in RAM. A PM2 restart wipes all job history and kills all schedules.
- Save `jobs` dict to `jobs.json` on every state change (create, status update, completion)
- On app startup, load `jobs.json` and re-register all `scheduled` jobs into APScheduler
- Impact: **Zero job loss across restarts**
### 1.2 — Backup Retention Policies
> **Why:** Without retention, the backup disk fills up forever.
- Per-job retention rules: keep last **N** full backups, or keep backups no older than **X days**
- Auto-purge old backup directories after a new backup completes successfully
- Show retention info and countdown on the Jobs dashboard
- Impact: **Prevents disk exhaustion**, critical for unattended operation
### 1.3 — Email / Webhook Notifications
> **Why:** Admins can't watch a dashboard 24/7.
- Send email (SMTP) on: backup success, failure, or warning
- Send webhook (Slack, Teams, generic HTTP) on job completion
- Configurable per-job or globally
- Impact: **Instant alerting** on failures
---
## Priority 2 — Backup Integrity & Verification
A backup that can't be verified is a liability, not an asset.
### 2.1 — Checksum Verification
> **Why:** Bit-rot, network corruption, or a partial write can silently corrupt a backup.
- After each file download, compute **SHA-256** of the downloaded file
- Store checksums in a `manifest.json` next to each backup
- Optionally verify checksums before an upload or restore
### 2.2 — Backup Manifest & Catalog
> **Why:** You need a machine-readable record of every backup for audit and restore.
Each backup produces a `manifest.json`:
```json
{
"job_id": "...",
"vm_name": "Nakivo",
"started": "2026-06-22T01:52:00Z",
"finished": "2026-06-22T03:10:44Z",
"vcenter": "vcsa.noc.pens.ac.id",
"snapshot": "backup-1782067446",
"files": [
{ "path": "Nakivo/Nakivo.vmdk", "size_bytes": 491, "sha256": "..." },
{
"path": "Nakivo/Nakivo-flat.vmdk",
"size_bytes": 17179869184,
"sha256": "..."
},
{ "path": "Nakivo/Nakivo.vmx", "size_bytes": 3065, "sha256": "..." }
]
}
```
### 2.3 — Test Restore (Dry-Run)
> **Why:** The only way to know a backup works is to try restoring it.
- "Verify Backup" button in the UI
- Checks: manifest exists, all files present, SHA-256 matches, disk size matches vCenter
- Optionally: power on the VM in an isolated network (advanced)
---
## Priority 3 — Backup Strategies (Storage Efficiency)
### 3.1 — Incremental / Changed Block Tracking (CBT)
> **Why:** Downloading a full 16 GB disk every night is inefficient. CBT lets you only transfer **changed blocks**.
- Enable VMware CBT (`changeTrackingEnabled`) on the VM
- Use `vim.VirtualDisk.QueryChangedDiskAreas()` to get only changed extents
- Download only the changed byte ranges from the flat VMDK (HTTP Range requests)
- Store deltas alongside the full base backup
- Impact: **8099% reduction** in daily backup transfer size
> [!IMPORTANT]
> This is the #1 differentiator between amateur and enterprise backup tools.
### 3.2 — Deduplication
> **Why:** Multiple VMs often share identical OS blocks.
- Block-level deduplication using content hashing (e.g., SHA-256 per 4 MB block)
- Store a deduplicated block store; backups reference blocks by hash
- Tools: integrate with `zfs send` (if on ZFS) or implement a simple local content-addressable store
### 3.3 — Compression
> Already implemented (`zstd`), but integrate tighter with CBT deltas for per-block compression.
---
## Priority 4 — Security & Multi-User
### 4.1 — Encrypted Credential Storage
> **Why:** Currently vCenter passwords are in Flask signed cookies (not encrypted).
- Store credentials in server-side encrypted store (e.g., using `cryptography.fernet`)
- Never transmit plaintext passwords to frontend JavaScript
- Support environment variable injection (`VCENTER_PASSWORD`)
### 4.2 — Role-Based Access Control (RBAC)
> **Why:** In an enterprise, not everyone should have the same access.
| Role | Permissions |
| -------- | ---------------------------------------------------------------- |
| Admin | Full access — create/delete jobs, manage schedules, view all VMs |
| Operator | Start/stop jobs, view logs, cannot change schedules |
| Viewer | Read-only dashboard access |
- Local user accounts stored in a SQLite database with bcrypt-hashed passwords
- Simple session-based auth or JWT tokens
### 4.3 — Audit Log
> **Why:** Who ran a backup? Who deleted a job? Essential for compliance.
- Persistent append-only audit log
- Records: user, action, VM, timestamp, result
- Viewable in the UI with filtering
---
## Priority 5 — Operations & Monitoring
### 5.1 — REST API
> **Why:** Integrate with Ansible, Terraform, CI/CD pipelines, or your own monitoring system.
Expose a full REST API:
```
GET /api/v1/jobs — list all jobs
POST /api/v1/jobs — create job
GET /api/v1/jobs/{id} — job status + progress
POST /api/v1/jobs/{id}/cancel — cancel job
GET /api/v1/vms — list VMs
GET /api/v1/backups — list completed backups with manifests
POST /api/v1/backups/{id}/verify — trigger checksum verify
```
Include API key authentication (`X-API-Key` header).
### 5.2 — Metrics & Dashboard (Prometheus/Grafana)
> **Why:** At-a-glance health visibility across all backup jobs.
- Expose a `/metrics` endpoint (Prometheus format)
- Metrics: `backup_duration_seconds`, `backup_size_bytes`, `backup_success_total`, `backup_failure_total`
- Build a Grafana dashboard for the backup operations team
### 5.3 — Multi-vCenter Support
> **Why:** Enterprises run multiple vCenter clusters.
- Support multiple saved vCenter connections (not just session-based)
- Jobs can target VMs across different vCenter instances
- Unified jobs dashboard across all vCenters
### 5.4 — Storage Backend Plugins
> **Why:** Not everyone stores backups on local NFS.
| Backend | Use Case |
| ---------------- | --------------------------------- |
| NFS (current) | On-prem NAS |
| S3 / MinIO | Object storage (on-prem or cloud) |
| Azure Blob | Azure-hosted environments |
| Rclone (generic) | 60+ cloud providers |
---
## Priority 6 — Disaster Recovery Features
### 6.1 — Instant VM Recovery
> **Why:** RTO (Recovery Time Objective) of minutes, not hours.
- Register the downloaded VMDK directly back to vCenter without full copy
- Use `RegisterVM_Task` on the downloaded `.vmx` pointing to the backup directory
- If backup is on NFS, this is near-instant (no copy needed)
### 6.2 — Restore Wizard
> Add a "Restore" tab to the UI
- Browse backup catalog → select VM → select restore point → choose target host/datastore
- Options: restore in-place (overwrite) or restore as new VM (clone)
- Track restore progress like backup progress
### 6.3 — Off-site Replication
> **Why:** 3-2-1 backup rule: 3 copies, 2 different media, **1 offsite**.
- After backup completes, replicate to a secondary NFS, S3, or SFTP target
- Run replication in parallel or sequential
- Alert if replication fails even if backup succeeded
---
## Priority 7 — UI/UX Polish
### 7.1 — Backup Calendar View
- Visual calendar showing which VMs were backed up on which days
- Color-coded: green = success, red = failure, yellow = warning
### 7.2 — Storage Analytics
- Pie chart / bar chart: backup size per VM, storage growth over time
- Alert when NFS mount is above 80% full
### 7.3 — Live Progress Streaming (SSE/WebSocket)
> **Why:** Currently the log page requires polling. Server-Sent Events provide true live streaming.
- Replace AJAX polling with `EventSource` (SSE) for real-time log updates
- Show a live progress bar with phase labels: Connecting → Snapshot → Downloading → Compressing → Done
---
## Recommended Implementation Order
```mermaid
graph LR
A["1.1 Persistent Jobs"] --> B["1.2 Retention Policies"]
B --> C["1.3 Notifications"]
C --> D["2.1 Checksums"]
D --> E["3.1 CBT Incremental ⭐"]
E --> F["4.1 Encrypted Creds"]
F --> G["5.1 REST API"]
G --> H["6.2 Restore Wizard"]
```
| Phase | Features | Effort | Impact |
| ----------- | --------------- | -------- | ------------------------------------- |
| **Phase 1** | 1.1 + 1.2 + 1.3 | ~2 days | Survives restarts, alerts on failures |
| **Phase 2** | 2.1 + 2.2 + 5.1 | ~3 days | Trusted backups, API integration |
| **Phase 3** | 3.1 (CBT) | ~1 week | Game-changer: 90% less bandwidth |
| **Phase 4** | 4.1 + 4.2 + 4.3 | ~1 week | Enterprise security & compliance |
| **Phase 5** | 6.2 + 5.4 + 5.2 | ~2 weeks | Full DR capability |
notification with timeout (in case its failed the snapshot is not cleaned)