Summary
VMware Tanzu Kubernetes Grid (TKG) is a product for managing the lifecycle of Kubernetes clusters.
Since version v2.1.0, a solution has been provided to TKG customers for performing backup and restore to cluster objects on a management cluster, such that in the case of a disaster which causes the management cluster to be unavailable but the workload clusters remain accessible, the customer can provision a new management cluster instance, restore the cluster objects, and continue managing the existing workload clusters via the new instance. More details please refer to https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.2/tkg-deploy-mc-22/mgmt-manage-br-infra.html
In the context of this solution, "drift" refers to a situation where there is a mismatch between the resources recorded in the backup and the actual state of the infrastructure. This mismatch can lead to problems during the restoration process. To gain a better understanding of handling drift, please refer to the "Handling Drift"section in the doc to understand more details: https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.2/tkg-deploy-mc-22/mgmt-manage-br-infra.html#drift
To address the issue of drift, the Drift Detector has been introduced as a C tool. It compares the content of a backup with the current state of the infrastructure and generates a comprehensive report. This report assists users in identifying potential issues and performing necessary manual steps to mitigate the drift before initiating the restore workflow, thereby facilitating a smoother restoration process.
Requirements
The Drift Detector tool works with VMware Tanzu Kubernetes Grid (TKG) v2.3.0 and above.
More specific version matrix:
Drift Detector | TKG Version |
---|---|
0.1.0 | v2.3.0 |
Instructions
Background
When backing up and restoring the management and workload cluster infrastructure on vSphere, if cluster objects are changed after the most recent backup, the state of the system after a restore does not match its desired, most recent state. This problem is called "drift". Drift is complicated and hard to fix, we recommend take backup immediately after performing any actions that change the clusters, e.g. scale up/down the cluster, create new clusters, etc. and also schedule regular backups to mitigate the drift.
We provide a drift detect tool to help users find the possible drift before performing the restoration. But please note that as the drift is complicated, the detector is doing its best effort to find the drift, it may not cover all kinds of cases and should only be used as a reference.
How it works
The drift detector parses the infrastructure objects from the backup tarball, compares them with the real infrastructure resources and detects the differences between them. The tool also tries to read the Kubernetes nodes information from the API server of the workload clusters to help detect the drift.
How to install
Download and unzip the DOWNLOAD file linked above. The file contains binaries for Linux, MacOS, and Windows that you can use as shell commands without any installation process.
How to use
Use the drift detector tool before performing the restoration by following the steps:
Download the backup tarball
Download the backup tarball either from the backup store portal directly or use the Velero CLI:
velero backup download backup-name
Detect the drift with the detector
All the available options of the drift-detector command are as follows:
drift-detector detect -h
Detect the drifts between the backup and infrastructure
Usage:
drift-detector detect [flags]
Flags:
--backup string The local path of the backup tarball file. Required
--format string The report format. One of: (json) (default "json")
-h, --help help for detect
--ignore-healthy-resources Ignore the healthy resources in the report
--insecure-skip-verify Skip the verification of an infra server’s certificate during a connection.
-o, --output string Report output file. Required
--skip-access-apiserver Specify whether skip accessing the API servers of workload clusters during the detection
Global Flags:
-D, --debug Enable debug mode
The "--backup" option is required, it is used to set the local file system path of the backup tarball.
If the management cluster manages lots of workload clusters, the output of the detector will contain lots of information which is hard to locate the drift resources in the output. Users can set the "--ignore-healthy-resources" option to set the output contain only the drift resources.
Connecting to the API server of the workload clusters is helpful, but not required. If the API servers of the workload clusters are not accessible, set "--skip-access-apiserver" option to skip it.
Run the drift detector command:
drift-detector detect --backup my-backup-data.tar.gz --insecure-skip-verify -o report.json
The output is as follows:
The command output has three main parts that describe how the Kubernetes objects in the backup match the VMs and other infrastructure resources:
- Summary: An overview of the detect result, including the overall status, total cluster count in each status, total infrastructure machines that need further confirmation, and whether the detection process generated any errors.
- Resource list: The list of clusters and infrastructure machines which are marked with different statuses.
- JSON report: A detailed description of the detection result.
The Machine listings have four possible statuses:
- Healthy: The object matches an infrastructure resource.
- Stale: The object has no corresponding infrastructure resource. No manual remediation required, the TKG controller will take care of it
- Ghost: An infrastructure resource is found that has no corresponding object in the backup. Requires manual remediation.
- Unknown: Error during detection. Requires further investigation.
ControlPlane, Workers, Cluster, and overall Summary listings have four possible statuses:
- Healthy: If all the sub resources are healthy, the resource itself is marked as healthy.
- ManualRemediationNotRequired: The resource contains sub resources that are Stale. No manual remediation is required; the TKG controller will take care of it.
- ManualRemediationRequired: The resource contains sub resources that are Ghost. Requires manual remediation.
- NeedFurtherConfirmationInfraMachines: The resource matches no object and is not referenced by any cluster, so the detector cannot determine whether it is a Ghost TKG machine or has no connection to TKG. Requires further investigation.
- NeedFurtherConfirmation: Error during detection. Requires further investigation.
Remediation
Follow the guide to remediate all the ghost machines after performing the restration.
Contributors
Similar Flings
No similar flings found. Check these out instead...

Power Actions
Power Actions is a vSphere Client plug-in that provides an easy way to share PowerCLI scripts with users that have no PowerShell experience.

Site Recovery Manager Mobile
Site Recovery Manager is a business continuity and disaster recovery solution that helps you plan, test and run the recovery of virtual machines.

DRS Lens
DRS Lens is an attempt to provide a UI-based solution to help understand DRS better.

Drift Detector for Tanzu Kubernetes Grid Management Cluster
Drift Detector has been introduced as a C tool. It compares the content of a backup with the current state of the infrastructure and generates a comprehensive report.

SplinterDB Server
SplinterDB Server is a prototype key-value store powered by SplinterDB, a high-performance open-source storage engine developed at VMware Research.

Workspace ONE Mobileconfig Importer
The Workspace ONE mobileconfig Importer gives you the ability to import existing mobileconfig files directly into a Workspace ONE UEM environment as a Custom Settings profile, import app preference plist files in order to created managed preference profiles, and to create new Custom Settings profiles from scratch.