294 lines
12 KiB
Markdown
294 lines
12 KiB
Markdown
# vSphere Backup Manager
|
|
|
|
An enterprise-ready web interface and CLI tool to automate, schedule, and manage snapshot-based backups for virtual machines on VMware vCenter/ESXi. Designed for performance, reliability, and security.
|
|
|
|
---
|
|
|
|
## Key Features
|
|
|
|
- **Grouped Sequential Batch Backups**: Select multiple VMs to execute sequentially in a single job. Execution logs and progress indicators are merged into a single view. Multiple batch jobs can run **simultaneously** without interference — each has isolated thread-local logging, independent progress tracking, and separate SQLite state.
|
|
- **SHA-256 Checksum Verification & Cataloging**: Computes SHA-256 signatures immediately after each VMDK/VMX file download and generates a machine-readable `manifest.json` catalog alongside each backup run.
|
|
- **Pre-Upload Validation**: Automatically validates local checksums prior to remote transfers (e.g., SFTP) to protect storage vaults against silent write errors or network packet loss.
|
|
- **On-the-Fly ZST Verification**: Supports stream-decompression on the fly to verify `.zst` archives against original manifest signatures without needing local disk extraction.
|
|
- **Safe Force Stop (Cancellation)**: Safely halt running backups via the Web UI. The engine immediately aborts socket downloads and **automatically cleans up the VM snapshot** on the ESXi host before gracefully terminating.
|
|
- **Automated Retention Policies**: Define count-based (`keep_count` — keep the last N backups) or age-based (`keep_days` — clean up backups older than N days) retention policies per VM to manage storage automatically.
|
|
- **Resilient Scheduling**: Uses APScheduler to schedule daily, weekly, monthly, interval, 3-monthly, 6-monthly, or yearly backups. Schedules are persisted in `jobs.db` and automatically re-registered on app restarts.
|
|
- **Telegram Bot Alerts**: Send rich formatted backup status notifications via a Telegram Bot directly to any group or channel — no open SMTP ports required. Configurable per alert level (all / failures only).
|
|
- **SMTP & Sendmail Notifications**: Send HTML-formatted backup completion emails via an SMTP relay or the system `sendmail` binary.
|
|
- **Reports & Analytics Dashboard**: Visual Chart.js trends for backup size and duration over time, with per-run history log table and success-rate statistics.
|
|
- **Integrated NFS Mount Manager**: View, mount, and manage NFS/CIFS shares directly from the Web GUI, showing real-time mount status, total size, used capacity, and free disk space.
|
|
- **CBT Incremental Backups**: Optional Changed Block Tracking (CBT) mode drastically reduces transfer size for recurring scheduled jobs by downloading only changed disk extents.
|
|
|
|
---
|
|
|
|
## Requirements
|
|
|
|
- Python 3.8+
|
|
- System packages listed in `requirements.txt`:
|
|
- `pyvmomi` — VMware vSphere API Python SDK
|
|
- `requests` — vCenter HTTPS folder API transfers
|
|
- `paramiko` — SFTP remote storage replication
|
|
- `zstandard` — High-ratio backup compression
|
|
- `APScheduler` — Recurring backup scheduling
|
|
- `flask` — Web UI framework
|
|
- `gunicorn` — Production WSGI server
|
|
|
|
---
|
|
|
|
## Installation
|
|
|
|
1. **Clone the repository**:
|
|
```bash
|
|
git clone <repository_url>
|
|
cd backupvmware
|
|
```
|
|
|
|
2. **Set up a Python Virtual Environment**:
|
|
- **Linux**:
|
|
```bash
|
|
python3 -m venv venv
|
|
source venv/bin/activate
|
|
```
|
|
- **Windows**:
|
|
```powershell
|
|
python -m venv venv
|
|
.\\venv\\Scripts\\Activate.ps1
|
|
```
|
|
|
|
3. **Install dependencies**:
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
---
|
|
|
|
## Web GUI Setup
|
|
|
|
A Flask-based web interface utilizing a premium glassmorphic dark theme to manage backups, schedules, mounts, and real-time logs.
|
|
|
|
### Running with PM2 (Recommended for Production)
|
|
|
|
PM2 natively supports Python applications and keeps the server running across restarts or process crashes.
|
|
|
|
1. **Install PM2** (requires Node.js):
|
|
```bash
|
|
npm install -g pm2
|
|
```
|
|
|
|
2. **Start the Web GUI**:
|
|
Using the provided `ecosystem.config.js`:
|
|
```bash
|
|
pm2 start ecosystem.config.js
|
|
```
|
|
|
|
*(Optional)* If you are running inside a Python virtual environment (e.g. `venv`), edit `ecosystem.config.js` to point the `interpreter` to your venv's Python executable:
|
|
```javascript
|
|
interpreter: './venv/bin/python3'
|
|
```
|
|
|
|
3. **Useful PM2 Commands**:
|
|
- **Status Dashboard**: `pm2 status`
|
|
- **Real-time Console Logs**: `pm2 logs vsphere-backup-manager`
|
|
- **Restart Application**: `pm2 restart vsphere-backup-manager`
|
|
- **Stop Application**: `pm2 stop vsphere-backup-manager`
|
|
- **Enable Auto-start on Boot**: Run `pm2 startup` and execute the command it prints, followed by `pm2 save`.
|
|
|
|
---
|
|
|
|
## Notification Setup
|
|
|
|
### Telegram Bot (Recommended — works on port 443, no SMTP server needed)
|
|
|
|
1. Create a bot via [@BotFather](https://t.me/BotFather) on Telegram — it will give you a **Bot Token**.
|
|
2. Add the bot to a group or channel and send any message to it.
|
|
3. Find your **Chat ID** using the Telegram API:
|
|
- Open in browser: `https://api.telegram.org/bot<YOUR_TOKEN>/getUpdates`
|
|
- Look for `"chat":{"id": -xxxxxxxxxx}` in the response.
|
|
4. Open **Settings → Notifications** in the Web UI:
|
|
- Set **Webhook Payload Format** to `Telegram Bot Alert`.
|
|
- Enter your **Bot Token** and **Chat ID**.
|
|
- Click **Send Test Notification** to verify.
|
|
|
|
### SMTP Email
|
|
|
|
1. Open **Settings → Notifications** in the Web UI.
|
|
2. Enable **Email Notifications** and fill in your SMTP host, port, credentials, sender, and recipient.
|
|
3. Click **Send Test Email** to verify before saving.
|
|
|
|
### Webhook (Generic HTTP POST)
|
|
|
|
1. Open **Settings → Notifications** in the Web UI.
|
|
2. Enter a webhook URL (Slack, Teams, Discord, custom endpoint, etc.).
|
|
3. Choose the payload format (`JSON`, `Form`, or `Slack`).
|
|
4. Click **Send Test Notification** to verify.
|
|
|
|
---
|
|
|
|
## Alert Levels
|
|
|
|
Configure in **Settings** to control when notifications are sent:
|
|
|
|
| Level | Triggers on |
|
|
|---|---|
|
|
| `all` | Every backup completion (success, warning, or failure) |
|
|
| `failures` | Only on `failed` or `finished with errors` status |
|
|
| `disabled` | Never send notifications |
|
|
|
|
---
|
|
|
|
## CLI Usage
|
|
|
|
You can also execute standalone backups directly from the command line:
|
|
|
|
### Basic Backup
|
|
```bash
|
|
python vsphere_backup.py --host vc.example.com --user administrator@vsphere.local --vm MyVM --dest /mnt/nfs-backup --compress
|
|
```
|
|
|
|
### Backup with Remote SFTP Replication
|
|
```bash
|
|
python vsphere_backup.py --host vc.example.com --user administrator@vsphere.local --vm MyVM --dest /tmp/backups --sftp-host backup-vault.local --sftp-user vault-user --sftp-password vault-pass
|
|
```
|
|
|
|
---
|
|
|
|
## Manual Restore & Clone
|
|
|
|
Backups are stored in **native VMware format** (VMDK + VMX), so they can be restored directly to vCenter/ESXi without any conversion.
|
|
|
|
### Backup File Structure
|
|
|
|
```
|
|
backups/<VM_NAME>/backup-YYYYMMDDHHMMSS/
|
|
├── manifest.json ← SHA-256 checksums + metadata
|
|
├── <VM_NAME>.vmx ← VM configuration (CPU, RAM, network, etc.)
|
|
└── <datastore_name>/
|
|
└── <VM_NAME>/
|
|
├── <VM_NAME>.vmdk ← Disk descriptor (~500 bytes, plain text)
|
|
└── <VM_NAME>-flat.vmdk ← Actual disk data (full size)
|
|
```
|
|
|
|
With compression enabled, files are stored as `.vmdk.zst` / `-flat.vmdk.zst`.
|
|
|
|
### Restoring a VM (In-Place)
|
|
|
|
#### Step 1 — Decompress (if compressed)
|
|
|
|
```bash
|
|
zstd -d <VM_NAME>.vmdk.zst
|
|
zstd -d <VM_NAME>-flat.vmdk.zst
|
|
```
|
|
|
|
#### Step 2 — Verify Checksum
|
|
|
|
```bash
|
|
# Compare the output with the value in manifest.json
|
|
sha256sum <VM_NAME>-flat.vmdk
|
|
```
|
|
|
|
#### Step 3 — Upload to Datastore
|
|
|
|
**Option A — vSphere Web Client** (easiest)
|
|
|
|
1. Navigate to **Storage** → select the target datastore
|
|
2. Create or navigate to the VM folder
|
|
3. Upload the `.vmx`, `.vmdk`, and `-flat.vmdk` files
|
|
|
|
**Option B — SCP to ESXi host**
|
|
|
|
```bash
|
|
# Enable SSH on the ESXi host first, then:
|
|
scp -r ./backup-20260623020000/<datastore>/<VM_NAME>/ \
|
|
root@esxi-host:/vmfs/volumes/<datastore>/<VM_NAME>/
|
|
```
|
|
|
|
**Option C — PowerCLI**
|
|
|
|
```powershell
|
|
# Copy files to ESXi datastore via datastore browser
|
|
Copy-DatastoreItem -Item ".\\*.vmdk" -Destination "[datastore1] <VM_NAME>/"
|
|
```
|
|
|
|
#### Step 4 — Register the VM
|
|
|
|
Right-click the `.vmx` file in the datastore browser → **Register VM**, or use PowerCLI:
|
|
|
|
```powershell
|
|
New-VM -VMFilePath "[datastore1] <VM_NAME>/<VM_NAME>.vmx" -VMHost "esxi-host"
|
|
```
|
|
|
|
#### Step 5 — Power On
|
|
|
|
```powershell
|
|
Start-VM "<VM_NAME>"
|
|
```
|
|
|
|
### Cloning from Backup (New VM)
|
|
|
|
To restore a backup as a **separate new VM** without affecting the original:
|
|
|
|
1. Upload files to a **new folder** on the datastore (e.g. `<VM_NAME>-clone/`)
|
|
|
|
2. Edit the `.vmx` file — change these lines to avoid UUID/MAC conflicts:
|
|
|
|
```
|
|
displayName = "<VM_NAME>-clone"
|
|
uuid.bios = "generate a new UUID"
|
|
ethernet0.generateAddress = "00:0c:29:xx:xx:xx"
|
|
```
|
|
|
|
3. Remove any snapshot references if present:
|
|
|
|
```
|
|
# Delete or comment out lines starting with:
|
|
snapshot.redoNotWithParent =
|
|
```
|
|
|
|
4. Register and power on:
|
|
|
|
```powershell
|
|
New-VM -VMFilePath "[datastore1] <VM_NAME>-clone/<VM_NAME>.vmx"
|
|
Start-VM "<VM_NAME>-clone"
|
|
```
|
|
|
|
### Best Practices
|
|
|
|
- **Keep a copy** — never restore over your only backup copy
|
|
- **Test restore quarterly** — verify backups actually work before you need them
|
|
- **Isolated network first** — always boot cloned VMs on an isolated port group to check for IP conflicts before connecting to production
|
|
- **CBT resets on clone** — the first backup of a cloned VM will be a full backup (CBT state does not carry over)
|
|
- **Snapshot cleanup** — if the backup was taken with snapshots still active, remove orphaned snapshots after restore
|
|
|
|
---
|
|
|
|
## Safety & Architecture
|
|
|
|
### 1. Snapshot Isolation
|
|
The backup engine creates a temporary snapshot on the target VM, downloads the locked base files (`.vmdk` descriptors, `-flat.vmdk` disk data, and `.vmx` configurations) directly from the vCenter Datastore HTTP gateway, and deletes the snapshot immediately afterwards. Even on forced stop, the snapshot cleanup routine runs.
|
|
|
|
### 2. Thread-Safe Concurrent Job Execution
|
|
|
|
Two entirely different types of concurrency safety are in place:
|
|
|
|
**a) Multiple different jobs running simultaneously**
|
|
|
|
Each job runs in its own background thread. Log output uses a **thread-local path registry** (`threading.local()` in `backup_core.py`) — the overridden `print()` function checks the calling thread's registered log path and writes directly to that file, bypassing any global `sys.stdout` redirection. This eliminates the classic `ValueError: I/O operation on closed file` race condition where one job closing its log file would crash another job's write.
|
|
|
|
**b) Duplicate runs of the same job prevented**
|
|
|
|
An in-memory `active_job_threads` dictionary tracks which job IDs are currently executing and in which thread. Before starting execution, `run_job_thread` checks this registry. If the same job is already alive in another thread (e.g., a scheduled trigger fires at the exact same moment as a manual "Run Now" click), the duplicate is **silently aborted** without affecting the primary run.
|
|
|
|
### 3. SQLite Persistence & Multi-Worker Sync
|
|
|
|
Job records, status, schedules, and configuration settings are stored in `jobs.db` (SQLite). The application supports running behind Gunicorn with multiple worker processes:
|
|
|
|
- **Real-time progress writes**: Every progress callback update from an active backup job writes directly to SQLite (`save_job_to_db_direct`), not just on completion.
|
|
- **Route-level refresh**: The `/jobs`, `/job/<id>`, and `/api/job/<id>/status` routes call `load_jobs_db()` before rendering, syncing state from SQLite across all Gunicorn workers.
|
|
- **In-place merge strategy**: When loading from DB, running jobs in the current process are never overwritten by older DB snapshots from other workers.
|
|
|
|
### 4. SSL Configuration
|
|
Custom certificate verification options (`--no-verify-ssl` or Web checkbox) allow connecting to environments using self-signed vCenter certificates.
|
|
|
|
### 5. Pre-flight & Post-flight Disk Checks
|
|
Before every backup, the engine checks for and resolves `consolidationNeeded` conditions on the VM. After snapshot removal, another consolidation check runs automatically to keep the datastore clean.
|