2-Node Virtual SAN Software Defined Self Healing

I continue to think one of the hidden gem features of VMware Virtual SAN (VSAN) is its software defined self healing ability.  I recently received a request for a description of 2-Node self healing. I wrote about our self healing capabilities for 3-Node, 4-Node and more here. And I wrote about Virtual SAN 6 Rack Awareness Software Defined Self Healing with Failure Domains here. I suggest you check out both before reading the rest of this. I also suggest you check out these two posts on 2-Node VSAN for a description on how they work here and are licensed here.

For VSAN, protection levels can be defined through VMware’s Storage Policy Based Management (SPBM) which is built into vSphere and managed through vCenter.  VM objects can be assigned to different policy which dictates the protection level they receive on VSAN. With a 2-Node Virtual SAN there is only one option for protection, which is the default # Failures To Tolerate (#FTT) equal to 1 using RAID1 mirroring. In other words, each VM will write to both hosts, if one fails, the data exists on the other host and is accessible as long as the VSAN Witness VM is available.

Now that we support 2-Node VSAN, the smallest VSAN configuration possible is 2 physical nodes with 1 caching device (SSD, PCIe, or NVMe) and 1 capacity device (HDD, SSD, PCIe, or NVMe) each and one virtual node (VSAN Witness VM) to hold all the witness components. Let’s focus on a single VM with the default # Failures To Tolerate (#FTT) equal to 1.  A VM has at least 3 objects (namespace, swap, vmdk).  Each object has at least 3 components (data mirror 1, data mirror 2, witness) to satisfy #FTT=1.  Lets just focus on the vmdk object and say that the VM sits on host 1 with mirror components of its vmdk data on host 1 and 2 and the witness component on the virtual Witness VM (host 3).

01 - 2-Node VSAN min

OK, lets start causing some trouble.  With the default # Failures To Tolerate equal 1, VM data on VSAN should be available if a single caching device, a single capacity device, or an entire host fails.  If a single capacity device fails, lets say the one on esxi-02, no problem, another copy of the vmdk is available on esxi-01 and the witness is available on the Witness VM so all is good.  There is no outage, no downtime, VSAN has tolerated 1 failure causing loss of one mirror, and VSAN is doing its job per the defined policy and providing access to the remaining mirror copy of data.  Each object has more that 50% of its components available (one mirror and witness are 2 out of 3 i.e. 66% of the components available) so data will continue to be available unless there is a 2nd failure of either the caching device, capacity device, or esxi-01 host.  The situation is the same if the caching device on esxi-02 fails or the whole host esxi-02 fails. VM data on VSAN would still be available and accessible. If the VM happened to be running on esxi-02 then HA would fail it over to esxi-01 and data would be available. In this configuration, there is no automatic self healing because there’s no where to self heal to. Host esxi-02 would need to be repaired or replaced in order for self healing to kick in and get back to compliance with both mirrors and witness components available.

02 - 2-Node VSAN min

Self healing upon repair

How can we get back to the point where we are able to tolerate another failure?  We must repair or replace the failed caching device, capacity device, or failed host.  Once repaired or replaced, data will resync, and the VSAN Datastore will be back to compliance where it could then tolerate one failure.  With this minimum VSAN configuration, self healing happens only when the failed component is repaired or replaced.

03 - 2-Node VSAN min

2-Node VSAN Self Healing Within Hosts and Across Cluster

To get self healing within hosts and across the hosts in the cluster you must configure your hosts with more disks. Let’s investigate what happens when there are 2 SSD and 4 HDD per host and 4 hosts in a cluster and the policy is set to # Failures To Tolerate equal 1 using the RAID 1 (mirroring) protection method.

01~ - 2-Node VSAN.png

If one of the capacity devices on esxi-02 fails then VSAN could chose to self heal to:

  1. Other disks in the same disk group
  2. Other disks on other disk groups on the same host

The green disks in the diagram below are eligible targets for the new instant mirror copy of the vmdk:

02~ - 2-Node VSAN

This is not an all encompassing and thorough explanation of all the possible scenarios.  There are dependencies on how large the vmdk is, how much spare capacity is available on the disks, and other factors.  But, this should give you a good idea of how failures are tolerated and how self healing can kick in to get back to policy compliance.

Self Healing When SSD Fails

If there is a failure of the caching device on esxi-02 that supports the capacity devices that contain the mirror copy of the vmdk then VSAN could chose to self heal to:

  1. Other disks on other disk groups on the same host
  2. Other disks on other disk groups on other hosts.

The green disks in the diagram below are eligible targets for the new instant mirror of the vmdk:

03~ - 2-Node VSAN.png

Self Healing When a Host Fails

If there is a failure of a host (e.g. esxi-02) that supports mirror of the vmdk then VSAN cannot self heal until the host is repaired or replaced.

04~ - 2-Node VSAN

Summary

VMware Virtual SAN leverages all the disks on all the hosts in the VSAN datastore to self heal.  Note that I’ve only discussed above the self healing behavior of one VM but other VM’s on other hosts may have data on the same failed disk(s) but their mirror may be on different disks in the cluster and VSAN might choose to self heal to other different disks in the cluster.  Thus the self healing workload is a many-to-many operation and thus spread around all the disks in the VSAN datastore.

Self healing is enabled by default, behavior is dependent on the software defined protection policy (#FTT setting), and can occur to disks in the same disk group, to other disk groups on the same host, or to other disks on other hosts. The availability and self healing properties make VSAN a robust storage solution for all data center applications.

Virtual SAN 6 Rack Awareness – Software Defined Self Healing with Failure Domains

I continue to think one of the hidden gem features of VMware Virtual SAN (VSAN) is it’s software defined self healing ability. I wrote about it a few months back here in: Virtual SAN Software Defined Self Healing

Since Virtual SAN is such a different way to do storage, it allows for some interesting configuration combinations. With vSphere 6 (built into vSphere 6), VMware will be introducing a new add-on feature for Virtual SAN called “Rack Awareness” accomplished by creating multiple “Failure Domains” and placing hosts in the same rack into the same Failure Domain. This “Rack Awareness” feature exploits the # Failures To Tolerate policy of Virtual SAN.

The rest of this post will look a lot like the previous post I did on self healing but will translate it for the Rack Awareness feature.

Minimum Rack Awareness Configuration

Lets start with the smallest VSAN “Rack Awareness” configuration possible that provides redundancy: a 3 rack, 6 host (2 per rack) vSphere cluster with VSAN enabled and 1 SSD and 1 HDD per host. In VSAN, an SSD constitutes a disk group so the 1 HDD is placed into a Disk Group with the 1 SSD. The SSD performs the write and read caching for the HDD’s in its disk group. The HDD permanently stores the data.

Lets start with a single VM with the default # Failures To Tolerate (#FTT) equal to 1. A VM has at least 3 objects (namespace, swap, vmdk). Each object has 3 components (data 1, data 2, witness) to satisfy #FTT=1. Lets just focus on the vmdk object and say that the VM sits on host 1 with replicas/mirrors/copies (these terms can be used interchangeably) of its vmdk data on host 1 in rack 1 and host 2 in rack 2 and the witness on host 3 in rack 3. The rule in Virtual SAN is that each of these three components of an object (data 1, data 2, witness) must sit on different hosts. With Rack Awareness, they also must be in different hosts in different racks.

RackAware01

OK, lets start causing some trouble. With the default # Failures To Tolerate equal 1, VM data on VSAN should be available if a single SSD, a single HDD, a single host fails, or an entire rack fails.

Continue reading “Virtual SAN 6 Rack Awareness – Software Defined Self Healing with Failure Domains”

What is the RAW to Usable capacity in Virtual SAN (VSAN)?

I get asked this question a lot so in the spirit of this blog it was about time to write it up.

The only correct answer is “it depends”. Typically, the RAW to usable ratio is 2:1 (i.e. 50%). By default, 1TB RAW capacity equates to approximately 500GB usable capacity. Read on for more details.

In VSAN there are two choices that impact RAW to usable capacity. One is the protection level and the other is the Object Space Reservation (%). Lets start with protection.

Virtual SAN (VSAN) does not use hardware RAID (Disclaimer at the end). Thus, it does not suffer the capacity, performance, or management overhead penalty of hardware RAID. The raw capacity of the local disks on a host are presented to the ESXi hypervisor and when VSAN is enabled in the cluster the local disks are put into a shared pool that is presented to the cluster as a VSAN Datastore. To protect VM’s, VSAN implements software distributed RAID leveraging the disks in the VSAN Datastore. This is defined by setting policy. You can have different protection levels for different policies (Gold, Silver, Bronze) all satisfied by the same VSAN Datastore.

The VSAN protection policy setting is “Number of Failures to Tolerate (#FTT) and can be set to 0, 1, 2, 3. The default is #FTT=1 which means using distributed software RAID there will be 2 (#FTT+1) copies of the data on two different hosts in the cluster. So if the VM is 100GB then it takes 200GB of VSAN capacity to satisfy the protection. This would be analogous to RAID 1 on a storage array. But rather than writing to a disk then to another disk in the same host we write to another disk on another host in the cluster. With #FTT=1, VSAN can tolerate a single SSD failure, a single HDD failure, or a single host failure and maintain access to data. Valid settings for #FTT are 0, 1, 2, 3. If set to 3 then there will be 4 copies of the VM data thus RAW to usable would be 25%. In addition, there is a small formatting overhead (couple of MB) on each disk but is negligible in the grand scheme of things.

#FTT # Copies
(#FTT+1)
RAW-to-usable Capacity %
0 1 100%
1 2 50%
2 3 33%
3 4 25%

Perhaps you create the following policies with the specified #FTT:

  • Bronze with #FTT=0 (thus no failure protection)
  • Silver policy with #FTT=1 (default software RAID 1 protection)
  • Gold policy with #FTT=2 (able to maintain availability in the event of a double disk drive failure, double SSD failure, or double host failure)
  • Platinum policy with #FTT=3 (4 copies of the data).

Your RAW to useable capacity will depend on how many VM’s you place in the different policies and how much capacity each VM is allocated and consumes. Which brings us to the Object Space Reservation (%) discussion.

In VSAN, different policy can have different Object Space Reservation (%) (Full Provisioned percentages) associated with them. By default, all VM’s are thin provisioned thus 0% reservation. You can choose to fully provision any % up to 100%. If you create a VM that is put into a policy with Object Space Reservation equal to 50% and give it 500GB then initially it will consume 250GB out of the VSAN Datastore. If you leave the default of 0% reservation then it will not consume any capacity out of the VSAN Datastore but as data is written it will consume capacity per the protection level policy defined and described above.

That ended up being a longer write up than I anticipated but as you can see, it truly does depend. I suggest sticking to the rule of thumb of 50% RAW to usable. But if you are looking for exact RAW to usable capacity calculations you can refer to the VMware Virtual SAN Design and Sizing Guide found here. https://blogs.vmware.com/vsphere/2014/03/vmware-virtual-san-design-sizing-guide.html
Also, you can check out Duncan Epping’s Virtual SAN Datastore Calculator: http://vmwa.re/vsancalc

Disclaimer at the end: ESXi hosts require IO Controllers to present local disk for use in VSAN. The compatible controllers are found on the VSAN HCL here: http://www.vmware.com/resources/compatibility/search.php?deviceCategory=vsan

These controllers work in one of two modes; passthrough or RAID 0. In passthrough mode the RAW disks are presented to the ESXi hypervisor. In RAID 0 mode each disk needs to be placed in its own RAID 0 disk group and made available as local disks to the hypervisor. The exact RAID 0 configuration steps are dependent on the server and IO Controller vendor. Once each disk is placed in their own RAID 0 disk group you will then need to login via SSH to each of your ESXi hosts and run commands to ensure that the HDD’s are seen as “local” disks by Virtual SAN and that the SSD’s are seen as “local” and “SSD”.

I hope this is helpful. Of course questions and feedback is welcome.