Rizqi 57963274e1 feat: implement pre-flight and post-flight disk consolidation and orphaned snapshot cleanup logic

2026-06-23 01:51:53 +07:00

9.3 KiB

Raw Permalink Blame History

vSphere Backup Manager — Enterprise Roadmap

Current State ✅

The backup engine is working. It connects to vCenter, creates crash-consistent snapshots, downloads full VMDK flat disk data and VMX configs, and runs scheduled recurring jobs — all accessible via a modern Flask web UI.

Priority 1 — Core Reliability & Persistence

These are non-negotiable for a production system. Without them, the tool is still a "hobby project."

1.1 — Persistent Job Store

Why: Currently everything is in RAM. A PM2 restart wipes all job history and kills all schedules.

Save jobs dict to jobs.json on every state change (create, status update, completion)
On app startup, load jobs.json and re-register all scheduled jobs into APScheduler
Impact: Zero job loss across restarts

1.2 — Backup Retention Policies

Why: Without retention, the backup disk fills up forever.

Per-job retention rules: keep last N full backups, or keep backups no older than X days
Auto-purge old backup directories after a new backup completes successfully
Show retention info and countdown on the Jobs dashboard
Impact: Prevents disk exhaustion, critical for unattended operation

1.3 — Email / Webhook Notifications

Why: Admins can't watch a dashboard 24/7.

Send email (SMTP) on: backup success, failure, or warning
Send webhook (Slack, Teams, generic HTTP) on job completion
Configurable per-job or globally
Impact: Instant alerting on failures

Priority 2 — Backup Integrity & Verification

A backup that can't be verified is a liability, not an asset.

2.1 — Checksum Verification

Why: Bit-rot, network corruption, or a partial write can silently corrupt a backup.

After each file download, compute SHA-256 of the downloaded file
Store checksums in a manifest.json next to each backup
Optionally verify checksums before an upload or restore

2.2 — Backup Manifest & Catalog

Why: You need a machine-readable record of every backup for audit and restore.

Each backup produces a manifest.json:

{
  "job_id": "...",
  "vm_name": "Nakivo",
  "started": "2026-06-22T01:52:00Z",
  "finished": "2026-06-22T03:10:44Z",
  "vcenter": "vcsa.noc.pens.ac.id",
  "snapshot": "backup-1782067446",
  "files": [
    { "path": "Nakivo/Nakivo.vmdk", "size_bytes": 491, "sha256": "..." },
    {
      "path": "Nakivo/Nakivo-flat.vmdk",
      "size_bytes": 17179869184,
      "sha256": "..."
    },
    { "path": "Nakivo/Nakivo.vmx", "size_bytes": 3065, "sha256": "..." }
  ]
}

2.3 — Test Restore (Dry-Run)

Why: The only way to know a backup works is to try restoring it.

"Verify Backup" button in the UI
Checks: manifest exists, all files present, SHA-256 matches, disk size matches vCenter
Optionally: power on the VM in an isolated network (advanced)

Priority 3 — Backup Strategies (Storage Efficiency)

3.1 — Incremental / Changed Block Tracking (CBT)

Why: Downloading a full 16 GB disk every night is inefficient. CBT lets you only transfer changed blocks.

Enable VMware CBT (changeTrackingEnabled) on the VM
Use vim.VirtualDisk.QueryChangedDiskAreas() to get only changed extents
Download only the changed byte ranges from the flat VMDK (HTTP Range requests)
Store deltas alongside the full base backup
Impact: 80–99% reduction in daily backup transfer size

Important

This is the #1 differentiator between amateur and enterprise backup tools.

3.2 — Deduplication

Why: Multiple VMs often share identical OS blocks.

Block-level deduplication using content hashing (e.g., SHA-256 per 4 MB block)
Store a deduplicated block store; backups reference blocks by hash
Tools: integrate with zfs send (if on ZFS) or implement a simple local content-addressable store

3.3 — Compression

Already implemented (zstd), but integrate tighter with CBT deltas for per-block compression.

Priority 4 — Security & Multi-User

4.1 — Encrypted Credential Storage

Why: Currently vCenter passwords are in Flask signed cookies (not encrypted).

Store credentials in server-side encrypted store (e.g., using cryptography.fernet)
Never transmit plaintext passwords to frontend JavaScript
Support environment variable injection (VCENTER_PASSWORD)

4.2 — Role-Based Access Control (RBAC)

Why: In an enterprise, not everyone should have the same access.

Role	Permissions
Admin	Full access — create/delete jobs, manage schedules, view all VMs
Operator	Start/stop jobs, view logs, cannot change schedules
Viewer	Read-only dashboard access

Local user accounts stored in a SQLite database with bcrypt-hashed passwords
Simple session-based auth or JWT tokens

4.3 — Audit Log

Why: Who ran a backup? Who deleted a job? Essential for compliance.

Persistent append-only audit log
Records: user, action, VM, timestamp, result
Viewable in the UI with filtering

Priority 5 — Operations & Monitoring

5.1 — REST API

Why: Integrate with Ansible, Terraform, CI/CD pipelines, or your own monitoring system.

Expose a full REST API:

GET  /api/v1/jobs              — list all jobs
POST /api/v1/jobs              — create job
GET  /api/v1/jobs/{id}         — job status + progress
POST /api/v1/jobs/{id}/cancel  — cancel job
GET  /api/v1/vms               — list VMs
GET  /api/v1/backups           — list completed backups with manifests
POST /api/v1/backups/{id}/verify — trigger checksum verify

Include API key authentication (X-API-Key header).

5.2 — Metrics & Dashboard (Prometheus/Grafana)

Why: At-a-glance health visibility across all backup jobs.

Expose a /metrics endpoint (Prometheus format)
Metrics: backup_duration_seconds, backup_size_bytes, backup_success_total, backup_failure_total
Build a Grafana dashboard for the backup operations team

5.3 — Multi-vCenter Support

Why: Enterprises run multiple vCenter clusters.

Support multiple saved vCenter connections (not just session-based)
Jobs can target VMs across different vCenter instances
Unified jobs dashboard across all vCenters

5.4 — Storage Backend Plugins

Why: Not everyone stores backups on local NFS.

Backend	Use Case
NFS (current)	On-prem NAS
S3 / MinIO	Object storage (on-prem or cloud)
Azure Blob	Azure-hosted environments
Rclone (generic)	60+ cloud providers

Priority 6 — Disaster Recovery Features

6.1 — Instant VM Recovery

Why: RTO (Recovery Time Objective) of minutes, not hours.

Register the downloaded VMDK directly back to vCenter without full copy
Use RegisterVM_Task on the downloaded .vmx pointing to the backup directory
If backup is on NFS, this is near-instant (no copy needed)

6.2 — Restore Wizard

Add a "Restore" tab to the UI

Browse backup catalog → select VM → select restore point → choose target host/datastore
Options: restore in-place (overwrite) or restore as new VM (clone)
Track restore progress like backup progress

6.3 — Off-site Replication

Why: 3-2-1 backup rule: 3 copies, 2 different media, 1 offsite.

After backup completes, replicate to a secondary NFS, S3, or SFTP target
Run replication in parallel or sequential
Alert if replication fails even if backup succeeded

Priority 7 — UI/UX Polish

7.1 — Backup Calendar View

Visual calendar showing which VMs were backed up on which days
Color-coded: green = success, red = failure, yellow = warning

7.2 — Storage Analytics

Pie chart / bar chart: backup size per VM, storage growth over time
Alert when NFS mount is above 80% full

7.3 — Live Progress Streaming (SSE/WebSocket)

Why: Currently the log page requires polling. Server-Sent Events provide true live streaming.

Replace AJAX polling with EventSource (SSE) for real-time log updates
Show a live progress bar with phase labels: Connecting → Snapshot → Downloading → Compressing → Done

Recommended Implementation Order

graph LR
    A["1.1 Persistent Jobs"] --> B["1.2 Retention Policies"]
    B --> C["1.3 Notifications"]
    C --> D["2.1 Checksums"]
    D --> E["3.1 CBT Incremental ⭐"]
    E --> F["4.1 Encrypted Creds"]
    F --> G["5.1 REST API"]
    G --> H["6.2 Restore Wizard"]

Phase	Features	Effort	Impact
Phase 1	1.1 + 1.2 + 1.3	~2 days	Survives restarts, alerts on failures
Phase 2	2.1 + 2.2 + 5.1	~3 days	Trusted backups, API integration
Phase 3	3.1 (CBT)	~1 week	Game-changer: 90% less bandwidth
Phase 4	4.1 + 4.2 + 4.3	~1 week	Enterprise security & compliance
Phase 5	6.2 + 5.4 + 5.2	~2 weeks	Full DR capability

notification with timeout (in case its failed the snapshot is not cleaned)

9.3 KiB Raw Permalink Blame History Unescape Escape