vSAN Maintenance Mode Considerations

There are 3 options when putting a host in maintenance mode when that host is a member of a vSphere Cluster with vSAN enabled. You follow the normal process to put a host in maintenance mode, but if vSAN is enabled, these options will pop up:

Ensure accessibility
Full data migration
No data migration

There’s a 4^th consideration that I’ll describe at the end.

I would expect most virtualization administrators to pick “Ensure accessibility” almost every time.

Ensure accessibility

Before we investigate, I want to reinforce that vSAN, by default, is designed to work and continue to provide VM’s access to data even if a host disappears. The default vSAN policy is “Number of Failures To Tolerate” equal to 1 (#FTT=1), which means a HDD, SSD, or whole host (thus all the SSD and HDD on that host) can be unavailable, and data is available somewhere else on another host in the cluster. If a host is in maintenance mode, then it is down, but vSAN by default has another copy of the data on another host.

VMware documents the options here:

Place a Member of Virtual SAN Cluster in Maintenance Mode

Ensure accessibility

This option will check to make sure that putting the particular host in maintenance mode will not take away the only data copy of any VM. There are two scenarios I can think of that this would happen:

In Storage Policy Based Management, you created a Storage Policy based on vSAN with #FTT=0 and attached at least 1 VM to that policy and that VM has data on the host going into maintenance mode.
Somewhere in the cluster you have failed drives or hosts and vSAN self-healing rebuilds haven’t completed. You then put a host into maintenance mode and that host has the only good copy of data remaining.

As rare as these scenarios are, they are possible. By choosing the “Ensure accessibility” option, vSAN will find the single copies of data on that host and regenerate them on other hosts. Now when the host goes into maintenance mode, all VM data is available. This is not a full migration of all the data off that host, its just a migration of the necessary data to “ensure accessibility” by all the VM’s in the cluster. When the host goes into maintenance mode, it may take a little bit of time to complete the migration but you’ll know that VM’s won’t be impacted. During the maintenance of this host, some VM’s will likely be running in a degraded state with 1 less copy that the policy specifies. Personally, I think this choice makes the most sense most of the time, it is the default selection, and I expect vSphere administrators to choose this option almost every time.

No data migration

This option puts the host in maintenance mode no matter what’s going on in the cluster. I would expect virtualization administrators to almost never pick this option unless:

You know the cluster is completely healthy (no disk or host failures anywhere else)
The VM’s that would be impacted aren’t critical.
All the VM’s in the cluster are powered off.

For the reasons explained in the “Ensure accessibility” above, its possible that the host going into maintenance mode has the only good copy of the data. If this is not a problem, then choose this option for the fastest way to put a host into maintenance mode. Otherwise, choose “Ensure accessibility”.

Full data migration

I would expect virtualization administrators to choose this option less frequently than “Ensure Accessibility” but will choose it for a couple of reasons:

The host is being replaced by a new one.
The host will be down for a long time, longer than the normal maintenance window of applying a patch and rebooting.
You want to maintain the #FTT availability for all VM’s during the maintenance window

Keep in mind, if you choose this option you must have 4 or more hosts in your cluster, and you don’t mind waiting for the data migration to complete. The time to complete the data migration is dependent on the amount of capacity consumed on the host going into maintenance mode. Yes, this could take some time. The laws of physics apply. 10GbE helps to move more data in the same amount of time. And it helps if the overall environment is not too busy.

When the migration is complete, the host is essentially evacuated out of the cluster and all it’s data is spread across the remaining hosts. VM’s will not be running in a degraded state during the maintenance window and will be able to tolerate the failures per their #FTT policy.

4^th consideration

I mentioned there is a 4^th consideration. For the VM’s that you want protected with at least two copies of data (#FTT=1) even during maintenance windows, you have two options. One is to set the #FTT=2 for those VM’s so they have 3 copies on 3 different hosts. If one of those hosts is in maintenance mode and you didn’t choose “Full Data Migration” then you still have 2 copies on other hosts, thus the VM’s could tolerate another failure of a disk or host. You could choose to create a storage policy based on vSAN with #FTT=2 and attach your most critical VM’s to it. For more information on running business critical applications on vSAN see:

Running Microsoft Business Critical Application on Virtual SAN 6.0

I hope this helps in your decision making while administering vSAN. I recommend testing the scenarios prior to implementing a cluster in production so you get a feel for the various options.

Ensure accessibility

No data migration

Full data migration

4th consideration

Share this:

Related

Leave a comment Cancel reply

4^th consideration