Replays of Virtual SAN Sessions at VMworld 2016 That You Didn’t Want to Miss

What a great week last week at VMworld 2016. I had many good meetings with customers, participated in 3 breakout sessions, met up with some old friends and met some new ones. If you missed VMworld, well, then you missed a bunch of great sessions. There’s no way you could have seen them all, so, VMware has made them available here: http://www.vmworld.com/en/sessions/2016.html.

I participated in two sessions:

The first one was a customer panel discussion on Tuesday afternoon. I need to thank Glenn Brown from Stanley Black & Decker, Mike Caruso from Synergent, Tom Cronin from M&T Bank, and Andrew Schilling from Baystate Health who all did a fantastic job representing themselves, their companies, and their use of Virtual SAN. We had great interaction from the audience with lots of good questions. For a replay of the session check this out:

Four Unique Enterprise Customers Deployment of VMware Virtual SAN [STO7560]
Glen Brown
, System Engineer, Stanley Black and Decker
Michael Caruso, AVP Corporate Information Systems, Synergent
Tom Cronin, Sr. Staff Specialist – Platforms Engineering Group, M&T Bank
Frank Gesino, Senior Technical Account Manager, VMware
Andrew Schilling, Team Leader – IT Infrastructure, Baystate Health Inc.
Tuesday, Aug 30, 5:00 p.m. – 6:00 p.m.

The other session I was involved in was on Wednesday and repeated on Thursday. I had the good fortune to present with two VSAN Product Managers who are responsible for making VSAN great. Vahid Fereydounkolahi is responsible for driving new features into the VSAN product and Rakesh Radhakrishnan is responsible for making sure all the vendor hardware components are properly qualified for VSAN and for looking out into the future of new technologies like NVMe and RDMA to adopt into VSAN. For a replay of the two sessions check these out:

Peter Keilty, Office of the CTO, Americas Field – Storage and Availability, VMware, Inc.
Rakesh Radhakrishnan, Product Management & Strategy Leader, VMware
Wednesday, Aug 31, 2:00 p.m. – 3:00 p.m.
Vahid Fereydounkolahi kicked this one off discussion VSAN features, capabilities, and how it works, I took over in the middle to discuss Day 2 operations, and Rakesh Radhakrishnan finished it off discussing the Ready Node program as well as current and future flash and IO technology that VSAN incorporates or will incorporate.
Virtual SAN Technical Deep Dive and What’s New [STO8246R]

Thursday, Sep 01, 10:30 a.m. – 11:30 a.m.
Vahid wasn’t able to make this time so I kicked things off talking about VSAN features, capabilities, how it works, and Day 2 operations, and Rakesh Radhakrishnan finished it off discussing the Ready Node program as well as current and future flash and IO technology that VSAN incorporates or will incorporate.
Virtual SAN Technical Deep Dive and What’s New [STO8246R]

In my previous blog post I highlighted the sessions you wouldn’t want to miss. So here, I’ll provide the links to those sessions. A few either haven’t been uploaded yet or won’t because of legal or future looking reasons:

Christos Karamanolis is literally the brains behind VSAN since its inception and our chief visionary for Storage. If you want the whole picture wrapped up in a 1 hour session, this is it.
An Industry Roadmap: From storage to data management [STO7903]
Christos Karamanolis, VMware Fellow – CTO of Storage and Availability, VMware
Wednesday, Aug 31, 4:00 p.m. – 5:00 p.m.

Continue reading “Replays of Virtual SAN Sessions at VMworld 2016 That You Didn’t Want to Miss”

2-Node Virtual SAN Software Defined Self Healing

I continue to think one of the hidden gem features of VMware Virtual SAN (VSAN) is its software defined self healing ability.  I recently received a request for a description of 2-Node self healing. I wrote about our self healing capabilities for 3-Node, 4-Node and more here. And I wrote about Virtual SAN 6 Rack Awareness Software Defined Self Healing with Failure Domains here. I suggest you check out both before reading the rest of this. I also suggest you check out these two posts on 2-Node VSAN for a description on how they work here and are licensed here.

For VSAN, protection levels can be defined through VMware’s Storage Policy Based Management (SPBM) which is built into vSphere and managed through vCenter.  VM objects can be assigned to different policy which dictates the protection level they receive on VSAN. With a 2-Node Virtual SAN there is only one option for protection, which is the default # Failures To Tolerate (#FTT) equal to 1 using RAID1 mirroring. In other words, each VM will write to both hosts, if one fails, the data exists on the other host and is accessible as long as the VSAN Witness VM is available.

Now that we support 2-Node VSAN, the smallest VSAN configuration possible is 2 physical nodes with 1 caching device (SSD, PCIe, or NVMe) and 1 capacity device (HDD, SSD, PCIe, or NVMe) each and one virtual node (VSAN Witness VM) to hold all the witness components. Let’s focus on a single VM with the default # Failures To Tolerate (#FTT) equal to 1.  A VM has at least 3 objects (namespace, swap, vmdk).  Each object has at least 3 components (data mirror 1, data mirror 2, witness) to satisfy #FTT=1.  Lets just focus on the vmdk object and say that the VM sits on host 1 with mirror components of its vmdk data on host 1 and 2 and the witness component on the virtual Witness VM (host 3).

01 - 2-Node VSAN min

OK, lets start causing some trouble.  With the default # Failures To Tolerate equal 1, VM data on VSAN should be available if a single caching device, a single capacity device, or an entire host fails.  If a single capacity device fails, lets say the one on esxi-02, no problem, another copy of the vmdk is available on esxi-01 and the witness is available on the Witness VM so all is good.  There is no outage, no downtime, VSAN has tolerated 1 failure causing loss of one mirror, and VSAN is doing its job per the defined policy and providing access to the remaining mirror copy of data.  Each object has more that 50% of its components available (one mirror and witness are 2 out of 3 i.e. 66% of the components available) so data will continue to be available unless there is a 2nd failure of either the caching device, capacity device, or esxi-01 host.  The situation is the same if the caching device on esxi-02 fails or the whole host esxi-02 fails. VM data on VSAN would still be available and accessible. If the VM happened to be running on esxi-02 then HA would fail it over to esxi-01 and data would be available. In this configuration, there is no automatic self healing because there’s no where to self heal to. Host esxi-02 would need to be repaired or replaced in order for self healing to kick in and get back to compliance with both mirrors and witness components available.

02 - 2-Node VSAN min

Self healing upon repair

How can we get back to the point where we are able to tolerate another failure?  We must repair or replace the failed caching device, capacity device, or failed host.  Once repaired or replaced, data will resync, and the VSAN Datastore will be back to compliance where it could then tolerate one failure.  With this minimum VSAN configuration, self healing happens only when the failed component is repaired or replaced.

03 - 2-Node VSAN min

2-Node VSAN Self Healing Within Hosts and Across Cluster

To get self healing within hosts and across the hosts in the cluster you must configure your hosts with more disks. Let’s investigate what happens when there are 2 SSD and 4 HDD per host and 4 hosts in a cluster and the policy is set to # Failures To Tolerate equal 1 using the RAID 1 (mirroring) protection method.

01~ - 2-Node VSAN.png

If one of the capacity devices on esxi-02 fails then VSAN could chose to self heal to:

  1. Other disks in the same disk group
  2. Other disks on other disk groups on the same host

The green disks in the diagram below are eligible targets for the new instant mirror copy of the vmdk:

02~ - 2-Node VSAN

This is not an all encompassing and thorough explanation of all the possible scenarios.  There are dependencies on how large the vmdk is, how much spare capacity is available on the disks, and other factors.  But, this should give you a good idea of how failures are tolerated and how self healing can kick in to get back to policy compliance.

Self Healing When SSD Fails

If there is a failure of the caching device on esxi-02 that supports the capacity devices that contain the mirror copy of the vmdk then VSAN could chose to self heal to:

  1. Other disks on other disk groups on the same host
  2. Other disks on other disk groups on other hosts.

The green disks in the diagram below are eligible targets for the new instant mirror of the vmdk:

03~ - 2-Node VSAN.png

Self Healing When a Host Fails

If there is a failure of a host (e.g. esxi-02) that supports mirror of the vmdk then VSAN cannot self heal until the host is repaired or replaced.

04~ - 2-Node VSAN

Summary

VMware Virtual SAN leverages all the disks on all the hosts in the VSAN datastore to self heal.  Note that I’ve only discussed above the self healing behavior of one VM but other VM’s on other hosts may have data on the same failed disk(s) but their mirror may be on different disks in the cluster and VSAN might choose to self heal to other different disks in the cluster.  Thus the self healing workload is a many-to-many operation and thus spread around all the disks in the VSAN datastore.

Self healing is enabled by default, behavior is dependent on the software defined protection policy (#FTT setting), and can occur to disks in the same disk group, to other disk groups on the same host, or to other disks on other hosts. The availability and self healing properties make VSAN a robust storage solution for all data center applications.

“Virtualization and Cloud Are Here to Stay” PC Connection podcast series – VMware Software Defined Storage and Virtual SAN

This is another fun short project I was fortunate enough to be involved in with a great VMware partner, PC Connection.

VMware Software Defined Storage and Virtual SAN

This is part of their “Virtualization and Cloud Are Here to Stay” podcast series.  Thanks to PC Connection for letting me be a part of it.

 

Quick discussion on VVols

One of the big topics at VMworld 2014 was VVols.  VMware announced it will be part of the next release of vSphere and almost every storage vendor on the planet is excited about the benefits that VVols bring.  I was working the VVol booth at VMworld and had the pleasure of being interviewed by VMworld TV to discuss the comparison between VSAN and VVols.  This was fun but unscripted and off the cuff so here it is:

VMworld TV Interview: Peter Keilty of VMware Discussed Virtual Volumes

What I’m trying to say is:

  • VSAN is the first supported storage solution takes advantage of VVols.
  • VVols, in vSphere.NEXT, will work in conjunction with VASA to allow all block and file based storage arrays to fully realize the benefits of Storage Policy Based Management (SPBM).
  • Each storage vendor can write a VASA/VVol provider that registers with vCenter to integrate with the vSphere API’s and promote their storage capabilities to vCenter. I expect just about every storage array vendor to do this.  I have seen VVol demonstrations by EMC, NetApp, Dell, HP, and IBM.
  • VVols eliminates the requirements of creating LUNs or Volumes on the arrays, instead, arrays present a pool or multiple pools of capacity in the form of storage containers that the hosts in the cluster see as datastores
  • Through SPBM, administrators can create different service levels in the form of policy that can be satisfied by the underlying storage provider container.
  • When VM’s get provisioned, they get assigned to a policy, and their objects (namespace, swap, vmdk’s, snap/clones) get placed as native objects into the container in the form of VVols.
  • You can even assign objects from the same VM to different policy to give them different service levels, all potentially satisfied by the same storage provider or perhaps different provider containers.  In other words, a vmdk for an OS image might want dedupe enabled but a vmdk for a database might not want dedupe but might want cache acceleration.  Different policy can be set and each object can be assigned to the policy that will deliver the desired service level.  The objects could be placed into the same storage array pool but taking advantage of different storage array features.  And these can be changed on the fly as needed.

Like all the storage vendors out there, I’m very excited about the benefits of VVols.  For a full description and deep dive check out this awesome VMworld session by Rawlinson Rivera (http://www.punchingclouds.com/) and Suzy Visvanathan:

Virtual Volumes Technical Deep Dive

Virtual SAN Software Defined Self Healing

I think one of the hidden gem features of VMware Virtual SAN (VSAN) is it’s software defined self healing ability.  On the surface this concept is simple.  The entire pool of disks in VSAN are used as hot spares.  In the event of a failure, data from the failed disks or hosts are found on other disks in the cluster and replicas (mirrors) are rebuilt onto other disks in the cluster to get back to having redundant copies for protection.  For VSAN, the protection level is defined through VMware’s Storage Policy Based Management (SPBM) which is built into vSphere and managed through vCenter.  OK, lets get into the details.

Lets start with the smallest VSAN configuration possible that provides redundancy, a 3 host vSphere cluster with VSAN enabled and 1 SSD and 1 HDD per host.  And, lets start with a single VM with the default # Failures To Tolerate (#FTT) equal to 1.  A VM has at least 3 objects (namespace, swap, vmdk).  Each object has 3 components (data 1, data 2, witness) to satisfy #FTT=1.  Lets just focus on the vmdk object and say that the VM sits on host 1 with copies of its vmdk data on host 1 and 2 and the witness on host 3.

Minimum VSAN configuration with VM policy of #FTT=1

OK, lets start causing some trouble.  With the default # Failures To Tolerate equal 1, VM data on VSAN should be available if a single SSD, a single HDD, or an entire host fails.

Continue reading “Virtual SAN Software Defined Self Healing”

Virtual SAN Disaster Recovery – vSphere Replication (available now) or Virtual RecoverPoint (coming soon), choose your protection!

I’m often asked how to protect Virtual SAN (VSAN). Its simple, any product focused on protecting a virtual machine (VM) will work for protecting VM’s sitting on a VSAN enabled vSphere cluster. VMware offers VDP/VDPA for backup & recovery and there are many other VMware partners with backup & recovery solutions focused on protecting VM’s. Backup & Recovery is a great way to protect data but some customers like the benefit of more granular recovery points that comes from data replication either locally or to a disaster recovery site.

To protect VSAN data in a primary site to a remote disaster recovery site VMware offers vSphere Replication (VR) to replicate the VM data sitting on a VSAN Datastore over the DR site. Of course Site Recovery Manager (SRM) is supported to automate failover, failback and testing. The VR/SRM combined solution can also be used for planned data center migrations. Here are a few great write-ups on the topics:

VMware Virtual SAN Interoperability: vSphere Replication and vCenter Site Recovery Manager

Virtual SAN Interoperability – Planned migration with vSphere Replication and SRM

VSAN and vSphere Replication Interop

One of the main benefits of VR is that it will work to replicate VM data on any storage to another site with hosts connected to any other storage. So, VSAN can be the source, the target, or both.

VSAN & VR

 

vSphere Replication can be set to asynchronously replicate every day, hour, or up to every 15 minutes. Thus providing a Recovery Point Objective (RPO) of up to 15 minutes. For many customers, this is “good enough”. For some customer workloads, asynchronous replication is not “good enough”. They need synchronous replication protection and there are several solutions in the market. One that I’ve been a big fan of for a long time is EMC’s RecoverPoint which has a great reputation for protecting enterprise mission critical data and applications.  Essentially it splits every write transaction, journals it, and synchronously makes a copy of it either locally or to a remote DR site without impacting application performance. Of course there are more details but this is essentially what it does which results in being able to recover back to any point in time. Often it’s labeled as “Tivo or DVR for the data center”. One other benefit of RecoverPoint is it can replicate data from any storage to any storage, as long as there is a splitter for the storage. EMC VNX and VMAX storage arrays have splitters built in.

The big news that just came out last week that peeked my interest is that EMC is now offering a Beta of a completely software based RecoverPoint solution that embeds the splitter into vSphere. This brings the RecoverPoint benefits to any VMware customer running VM’s on any storage: block, file, or of course even VSAN. The EMC initiative is call Project Mercury and for more information check out:

Summer Gift Part 1 – Project Mercury Beta Open!

I’m excited that VSAN customers will have a choice for data protection, asynchronously with 15 minute RPO using vSphere Replication or continuous, synchronous, and asynchronous with EMC’s Virtual RecoverPoint.

What does a 32 host Virtual SAN (VSAN) Cluster Look Like?

The big VMware Virtual SAN (VSAN) launch was today. Here are a couple of good summaries:

Cormac Hogan – Virtual SAN (VSAN) Announcement Review

Duncan Epping – VMware Virtual SAN launch and book pre-announcement!

The big news is that VSAN will support a full 32 hosts vSphere cluster. So what does that look like fully scaled up and out?

VSAN - 32 Hosts

By the way, for details on how VSAN scales up and out check: Is Virtual SAN (VSAN) Scale Up or Scale Out Storage…, Yes!.