fling logo of Drift Detector for Tanzu Kubernetes Grid Management Cluster

Drift Detector for Tanzu Kubernetes Grid Management Cluster

version 0.1.0 — July 12, 2023

Contributors 3

View All

Comments 1

  • profile picture of Priddybroderick
View All

Summary

VMware Tanzu Kubernetes Grid (TKG) is a product for managing the lifecycle of Kubernetes clusters.

Since version v2.1.0, a solution has been provided to TKG customers for performing backup and restore to cluster objects on a management cluster, such that in the case of a disaster which causes the management cluster to be unavailable but the workload clusters remain accessible, the customer can provision a new management cluster instance, restore the cluster objects, and continue managing the existing workload clusters via the new instance. More details please refer to https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.2/tkg-deploy-mc-22/mgmt-manage-br-infra.html

In the context of this solution, "drift" refers to a situation where there is a mismatch between the resources recorded in the backup and the actual state of the infrastructure. This mismatch can lead to problems during the restoration process. To gain a better understanding of handling drift, please refer to the "Handling Drift"section in the doc to understand more details: https://docs.vmware.com/en/VMware-Tanzu-Kubernetes-Grid/2.2/tkg-deploy-mc-22/mgmt-manage-br-infra.html#drift

To address the issue of drift, the Drift Detector has been introduced as a C tool. It compares the content of a backup with the current state of the infrastructure and generates a comprehensive report. This report assists users in identifying potential issues and performing necessary manual steps to mitigate the drift before initiating the restore workflow, thereby facilitating a smoother restoration process.

Requirements

The Drift Detector tool works with VMware Tanzu Kubernetes Grid (TKG) v2.3.0 and above.

More specific version matrix:

Drift Detector TKG Version
0.1.0 v2.3.0

Instructions

Background

When backing up and restoring the management and workload cluster infrastructure on vSphere, if cluster objects are changed after the most recent backup, the state of the system after a restore does not match its desired, most recent state. This problem is called "drift". Drift is complicated and hard to fix, we recommend take backup immediately after performing any actions that change the clusters, e.g. scale up/down the cluster, create new clusters, etc. and also schedule regular backups to mitigate the drift.

We provide a drift detect tool to help users find the possible drift before performing the restoration. But please note that as the drift is complicated, the detector is doing its best effort to find the drift, it may not cover all kinds of cases and should only be used as a reference.

How it works

The drift detector parses the infrastructure objects from the backup tarball, compares them with the real infrastructure resources and detects the differences between them. The tool also tries to read the Kubernetes nodes information from the API server of the workload clusters to help detect the drift.

How to install

Download and unzip the DOWNLOAD file linked above. The file contains binaries for Linux, MacOS, and Windows that you can use as shell commands without any installation process.

How to use

Use the drift detector tool before performing the restoration by following the steps:

Download the backup tarball

Download the backup tarball either from the backup store portal directly or use the Velero CLI:

velero backup download backup-name
            

Detect the drift with the detector

All the available options of the drift-detector command are as follows:


drift-detector detect -h
Detect the drifts between the backup and infrastructure

Usage:
  drift-detector detect [flags]

Flags:
      --backup string              The local path of the backup tarball file. Required
      --format string              The report format. One of: (json) (default "json")
  -h, --help                       help for detect
      --ignore-healthy-resources   Ignore the healthy resources in the report
      --insecure-skip-verify       Skip the verification of an infra server’s certificate during a connection.
  -o, --output string              Report output file. Required
      --skip-access-apiserver      Specify whether skip accessing the API servers of workload clusters during the detection

Global Flags:
  -D, --debug   Enable debug mode
            

The "--backup" option is required, it is used to set the local file system path of the backup tarball.

If the management cluster manages lots of workload clusters, the output of the detector will contain lots of information which is hard to locate the drift resources in the output. Users can set the "--ignore-healthy-resources" option to set the output contain only the drift resources.

Connecting to the API server of the workload clusters is helpful, but not required. If the API servers of the workload clusters are not accessible, set "--skip-access-apiserver" option to skip it.

Run the drift detector command:

drift-detector detect --backup my-backup-data.tar.gz --insecure-skip-verify -o report.json

The output is as follows:

The command output has three main parts that describe how the Kubernetes objects in the backup match the VMs and other infrastructure resources:

  • Summary: An overview of the detect result, including the overall status, total cluster count in each status, total infrastructure machines that need further confirmation, and whether the detection process generated any errors.
  • Resource list: The list of clusters and infrastructure machines which are marked with different statuses.
  • JSON report: A detailed description of the detection result.

The Machine listings have four possible statuses:

  • Healthy: The object matches an infrastructure resource.
  • Stale: The object has no corresponding infrastructure resource. No manual remediation required, the TKG controller will take care of it
  • Ghost: An infrastructure resource is found that has no corresponding object in the backup. Requires manual remediation.
  • Unknown: Error during detection. Requires further investigation.

ControlPlane, Workers, Cluster, and overall Summary listings have four possible statuses:

  • Healthy: If all the sub resources are healthy, the resource itself is marked as healthy.
  • ManualRemediationNotRequired: The resource contains sub resources that are Stale. No manual remediation is required; the TKG controller will take care of it.
  • ManualRemediationRequired: The resource contains sub resources that are Ghost. Requires manual remediation.
  • NeedFurtherConfirmationInfraMachines: The resource matches no object and is not referenced by any cluster, so the detector cannot determine whether it is a Ghost TKG machine or has no connection to TKG. Requires further investigation.
  • NeedFurtherConfirmation: Error during detection. Requires further investigation.

Remediation

Follow the guide to remediate all the ghost machines after performing the restration.

Similar Flings

No similar flings found. Check these out instead...
Sep 05, 2023
UPDATED
fling logo of Power Actions

Power Actions

version 1.0.3 Build 22361595

Power Actions is a vSphere Client plug-in that provides an easy way to share PowerCLI scripts with users that have no PowerShell experience.

Aug 03, 2021
fling logo of Site Recovery Manager Mobile

Site Recovery Manager Mobile

version 1.0.1

Site Recovery Manager is a business continuity and disaster recovery solution that helps you plan, test and run the recovery of virtual machines.

Jun 18, 2018
fling logo of DRS Lens

DRS Lens

version 1.3

DRS Lens is an attempt to provide a UI-based solution to help understand DRS better.

Jul 12, 2023
fling logo of Drift Detector for Tanzu Kubernetes Grid Management Cluster

Drift Detector for Tanzu Kubernetes Grid Management Cluster

version 0.1.0

Drift Detector has been introduced as a C tool. It compares the content of a backup with the current state of the infrastructure and generates a comprehensive report.

Dec 05, 2022
fling logo of SplinterDB Server

SplinterDB Server

version 0.2.9

SplinterDB Server is a prototype key-value store powered by SplinterDB, a high-performance open-source storage engine developed at VMware Research.

Mar 03, 2021
fling logo of Workspace ONE Mobileconfig Importer

Workspace ONE Mobileconfig Importer

version 1.1

The Workspace ONE mobileconfig Importer gives you the ability to import existing mobileconfig files directly into a Workspace ONE UEM environment as a Custom Settings profile, import app preference plist files in order to created managed preference profiles, and to create new Custom Settings profiles from scratch.

View More