Chapter 8 - Stretched Cluster

This chapter was developed to provide insights and additional information on a very specific type of vSAN configuration, namely stretched clusters. In this chapter we will describe some of the design considerations, operational procedures, and failure scenarios that relate to a stretched cluster configuration specifically. But why would anyone want a stretched cluster in the first place?

Stretched cluster configurations offer the ability to balance VMs between datacenters. The reason for doing so could be anything, be it disaster avoidance or for instance site maintenance, all of this with no downtime from a VM perspective since compute, storage, and network are available across both sites. On top of that, a stretched cluster also provides the ability to active load balance resources between locations without any constraints.

What is a Stretched Cluster?

Before we get in to it, let’s first discuss what defines a vSAN stretched cluster. When we talk about a vSAN stretched cluster, we refer to the configuration that is deployed when the stretched cluster workflow is completed in the Web Client. This workflow explicitly leverages a witness host, which can be physical or virtual, and needs to be deployed in a third site. During the workflow the vSAN cluster is set up across two active/active sites, with an identical number of ESXi hosts distributed evenly between the two sites, and as stated with a witness host residing at a third site. The data sites are connected via a high bandwidth/low latency link. The third site hosting the vSAN witness host is connected to both of the active/active data-sites. The connectivity between the data sites and the witness site can be via lower bandwidth/higher latency links. Figure 8.1 shows what this looks like from a logical point of view.

Figure 8.1 - Stretched cluster scenario

Each site is configured as a vSAN fault domain. A site can be considered a fault domain. A maximum of three sites (two data, one witness) is supported.

The nomenclature used to describe a vSAN Stretched Cluster configuration is X+Y+Z, where X is the number of ESXi hosts at data site A, Y is the number of ESXi hosts at data site B, and Z is the number of witness hosts at site C. Data sites are where virtual machines are deployed. The minimum supported configuration is 1+1+1 (3 nodes). The maximum configuration at the time of writing is 15+15+1 (31 nodes).

In vSAN stretched clusters, there is only one witness host in any configuration. For deployments that manage multiple stretched clusters, each cluster must have its own unique witness host.

When a VM is deployed on a vSAN stretched cluster it will have one copy of its data on site A, a second copy of its data on site B and witness components placed on the witness host in site C. This configuration is achieved through fault domains. In the event of a complete site failure, there will be a full copy of the VM data as well as greater than 50% of the components available. This will allow the VM to remain available on the vSAN datastore. If the VM needs to be restarted on the other data site, vSphere HA will handle this task.

Requirements and Constraints

vSAN stretched cluster configurations requires vSphere 6.0.0 Update1 (U1) at a minimum. This implies both vCenter Server 6.0 U1 and ESXi 6.0 U1. This version of vSphere includes vSAN version 6.1. This is the minimum version required for vSAN stretched cluster support. However we strongly recommend implementing the latest available version of vSAN, which at the time of writing was vSAN 6.2.

From a licensing point of view things have changed dramatically over the last 12 months and especially the licensing requirements for stretched clustering have changed. With vSAN version 6.1 a new licensing variant was introduced called “Advanced.” This version included both all-flash and stretched cluster functionality. As of version 6.2 however, a new licensing version has been added called “Enterprise” and this version now includes stretched cluster and QoS (limits) functionality. The advanced license includes all-flash, deduplication/compression and RAID-5/6. Table 8.1 shows what is included in which license edition for completeness. Note that the Enterprise license needs to have been entered and assigned to all hosts in the cluster before the stretched cluster can be formed! Before making any procurement decision, please consult the latest VMware licensing information.

Standard Advanced Enterprise
Storage policy-based management X X X
Read/write caching X X X
Distributed switch X X X
vSAN snapshot and clones X X X
Rack awareness X X X
Replication (5 minutes RPO) X X X
Software checksum X X X
All-flash X X X
Deduplication and compression X X
RAID-5/6 (erasure coding) X X
Stretched cluster X
QoS (IOPS Limits) X

Table 8.1 - License editions

There are no limitations placed on the edition of vSphere used for vSAN. However, for vSAN Stretched Cluster functionality, vSphere DRS is very desirable. DRS will provide initial placement assistance, and can also help with locating VMs to their correct site when a site recovers after a failure. Otherwise the administrator will have to manually carry out these tasks. Note that DRS is now only available in Enterprise Plus edition of vSphere. (Before Q1 of 2016, DRS was also available in Enterprise. Since then VMware has announced that the Enterprise license edition is end of availability.)

When it comes to vSAN functionality, VMware supports stretched clusters in both hybrid and all-flash configurations. In terms of on-disk formats, the minimum level of on-disk format required is v2, which comes by default with vSAN 6.0. (vSAN 6.2 comes with v3.)

Both physical ESXi hosts and virtual appliances (nested ESXi host in a VM) are supported for the witness host. VMware is providing a pre-configured witness appliance for those customers who wish to use it. A witness host/VM cannot be shared between multiple vSAN stretched clusters.

The following are a list of products and features supported on vSAN but not on a stretched cluster implementation of vSAN.

  • SMP-FT, the new Fault Tolerant VM mechanism introduced in vSphere 6.0, is supported on standard vSAN 6.1 deployments, but it is not supported on any stretched cluster deployment at this time, be it vSAN or vSphere Metro Storage Cluster (vMSC) based.
  • The maximum value for NumberOfFailuresToTolerate in vSAN Stretched Cluster is 1 whereas the maximum value for NumberOfFailuresToTolerate in standard vSAN is 3.
  • In vSAN stretched cluster, there is a limit of 3 for the number of fault domains. Standard vSAN can go much higher.
  • The fault tolerance method capability, introduced in vSAN 6.2, set to capacity (which allows for the use of RAID-5/6) is not supported in a stretched configuration. It must be left at the default setting of performance to use RAID-1.

Networking and Latency Requirements

When vSAN is deployed in a stretched cluster across multiple sites using fault domains, there are certain networking requirements that must be adhered to.

  • Between data sites both Layer 2 and Layer 3 is supported.
    • Layer-2 is recommended.
  • Between the data sites and the witness site Layer 3 is required.
    • This is to prevent IO to be routed through a potentially low bandwidth witness site.
  • Multicast is required between data sites.
    • In the case of Layer 3, protocol-independent multicast (PIM) sparse mode is strongly recommended. Consult with your network vendor for multicast routing best practices and limitations.
  • Only unicast is required between data sites and the witness site.
    • Here the multicast requirement is removed to simplify L3 network configurations.
  • Maximum round trip latency between data sites is 5 ms.
  • Maximum round trip latency between data sites and the witness site is 200 ms.
  • A bandwidth of 10 Gbps between data sites is recommended.
  • A bandwidth of 100 Mbps between data sites and the witness site is recommended.

Networking in any stretched vSphere deployment is always a hot topic. We expect this to be the same for vSAN stretched deployments. VMware has published two excellent guides that hold a lot of detail around network bandwidth calculations and network topology considerations. The above bandwidth recommendations are exactly that, recommendations. Requirements for your environment can be determined by calculating the exact needs as explained in the following three documents.

New Concepts in vSAN Stretched Cluster

A common question is how stretched cluster differs from regular fault domains. Fault domains enable what might be termed “rack awareness” where the components of VMs could be distributed amongst multiple hosts in multiple racks, and should a rack failure event occur, the VM would continue to be available. These racks would typically be hosted in the same data center, and if there were a data center wide event, fault domains would not be able to assist with VM availability.

Stretched cluster is essentially building on the foundation of fault domains, and now provides what might be termed “data center awareness.” vSAN stretched cluster can now provide availability for VMs even if a data center suffers a catastrophic outage. This is achieved primarily through intelligent component placement of VM objects across data sites, site preference, read locality, and the witness host.

The witness host is an ESXi host (or virtual appliance) whose purpose it is to host the witness component of VM objects. The witness must have connection to both the master vSAN node and the backup vSAN node to join the cluster (the master and backup were discussed previously in Chapter 5, “Architectural Details”). In steady state operations, the master node resides in the “preferred site”; the backup node resides in the “secondary site.”

Note that the witness appliance ships with its own license so it does not consume any of your vSphere or vSAN licenses. Hence it is our recommendation to always use the appliance over a physical witness host. The Witness Appliance also has a different icon in the vSphere Web Client than a regular ESXi hosts, allowing you to identify the witness appliance quickly as shown in Figure 8.2. This is only the case for the witness appliance however. A physical appliance will show up in the client as a regular host and also requires a vSphere license.

Figure 8.2 - Witness appliance icon

Another new term that will show up during the configuration of a stretched cluster, and was just mentioned, is “preferred site” and “secondary site.” The “preferred” site is the site that vSAN wishes to remain running when there is a network partition between the sites and the sites can no longer communicate. One might say that the “preferred site” is the site expected to have the most reliability.

Since VMs can run on any of the two sites, if network connectivity is lost between site 1 and site 2, but both still have connectivity to the witness, the preferred site is the one that survives and its vSAN components remains active, while the storage on the nonpreferred site is marked as down and the vSAN components on that site are marked as absent. This also means that any VMs running in the secondary site in this situation will need to be restarted in the primary site in order to be usable and useful again. vSphere HA, when enabled on the stretched cluster, will take care of this automatically.

In non-stretched vSAN clusters, a VM’s read operations are distributed across all replica copies of the data in the cluster. In the case of a policy setting ofNumberOfFailuresToTolerate=1, which results in two copies of the data, 50% of the reads will come from replica 1 and 50% will come from replica 2. Similarly, in the case of a policy setting ofNumberOfFailuresToTolerate=2 in nonstretched vSAN clusters, which results in three copies of the data, 33% of the reads will come from replica 1, 33% of the reads will come from replica 2 and 33% will come from replica 3.

However, we wish to avoid this situation with a stretched vSAN cluster, as we do not wish to read data over the intersite link, which could add unnecessary latency to the I/O and waste precious intersite link bandwidth. Since vSAN stretched cluster supports a maximum ofNumberOfFailuresToTolerate=1, there will be two copies of the data (replica 1 and replica 2). Rather than doing 50% reads from site 1 and 50% reads from site 2 across the site link, the goal is to do 100% of the read IO from the local site, wherever possible.

The distributed object manager (DOM) in vSAN,is responsible for dealing with read locality. DOM is not only responsible for the creation of virtual machine storage objects in the vSAN cluster, but it is also responsible for providing distributed data access paths to these objects. There is a single DOM owner per object. There are three roles within DOM; client, owner, and component manager. The DOM owner coordinates access to the object, including reads, locking, and object configuration and reconfiguration. All objects changes and writes also go through the owner. In vSAN stretched cluster, an enhancement to the DOM owner of an object means that it will now take into account the “fault domain” where the owner runs, and will read 100% from the replica that is in the same “fault domain.”

There is now another consideration with read locality for hybrid configurations. Administrators should avoid unnecessary vMotion of virtual machines between data sites. Since the read cache blocks are stored on one (local) site, if the VM moves around freely and ends up on the remote site, the cache will be cold on that site after the migration. Now there will be suboptimal performance until the cache is warmed again. To avoid this situation, soft (should) affinity rules (VM/Host rules) should be used to keep the virtual machine local to the same site/fault domain where possible. Note that this only applies to hybrid configurations, as all-flash configurations do not have a read cache.

Configuration of a Stretched Cluster

The installation of vSAN stretched cluster is almost identical to how fault domains were implemented in earlier vSAN versions, with a couple of additional steps. This part of the chapter will walk the reader through a stretched cluster configuration.

Before we get started with the actual configuration of a stretched cluster we will need to ensure the witness host is installed, configured, and accessible from both data sites. This will most likely involve the addition of static routes to the ESXi hosts and witness appliance, which will be covered shortly. When configuring your vSAN stretched cluster, only data hosts must be in the (vSAN) cluster object in vCenter Server. The witness host must remain outside of the cluster, and must not be added to the vCenter Server cluster at any point.

Note that the witness OVA must be deployed through a vCenter Server. In order to complete the deployment and configuration of the witness VM, it must be powered on the very first time through a vCenter Server as well. The witness OVA is also only supported with standard vSwitch (VSS) deployments.

The deployment of the witness host is pretty much straightforward and similar to the deployment of most virtual appliances as shown in Figure 8.3.

Figure 8.3 - Witness appliance deployment

The only real decision that needs to be made is with regards to the expected size of the stretched cluster configuration. There are three options offered. If you expect the number of VMs deployed on the vSAN stretched cluster to be 10 or fewer, select the Tiny configuration. If you expect to deploy more than 10 VMs, but less than 500 VMs, then the Medium (default option) should be chosen. For more than 500 VMs, choose the Large option. On selecting a particular configuration, the resources required by the appliance and displayed in the wizard (CPU, Memory and Disk) as shown in Figure 8.4.

Figure 8.4 - Selection of configuration size

Next the datastore where the witness appliance will need to be stored and the network that will be used for the witness appliance will need to be selected. This network will be associated with both network interfaces (management and vSAN) at deployment, so later on the vSAN network configuration may require updating when the vSAN and management traffic are seperated. Lastly a root password will need to be provided and the appliance will be deployed.

If there are different network segments used for management and vSAN traffic (usually the case) then after deployment a configuration change should be made to the VM. This can be done through the Web Client as shown in Figure 8.5.

Figure 8.5 - Change of networks

At this point the witness appliance can be powered on and the console of the witness should be accessed to add the correct networking information, such as IP address and DNS, for the management network. This is identical to how one would add the management network information of a physical ESXi host via the DCUI. After this has been done the witness can be added to the vCenter inventory as a regular host.

Note: The “No datastores have been configured” message is because the nested ESXi host has no VMFS datastore. This can be safely ignored.

Once the witness appliance/nested ESXi host has been added to vCenter, the next step is to configure the vSAN network correctly on the witness. When the witness is selected in the vCenter inventory, navigate to Manage > Networking > Virtual Switches. The witness has a port group predefined calledwitnessPg. Do not remove this port group, as it has special modification to make the MAC addresses on the network adapters match the nested ESXi MAC addresses. From this view, the VMkernel port to be used for vSAN traffic is visible. If there is no DHCP server on the vSAN network (which is likely), then the VMkernel adapter will not have a valid IP address, nor will it be tagged for vSAN traffic, and this will need to be added and the VMkernel port will need to be tagged for vSAN traffic.

Last but not least, before we can configure the vSAN stretched cluster, we need to ensure that the vSAN network on the hosts residing in the data sites can reach the witness host’s vSAN network, and vice-versa. To address this, administrators must implement static routes. Static routes tell the TCP/IP stack to use a different path to reach a particular network. Now we can tell the TCP/IP stack on the data hosts to use a different network path (instead of the default gateway) to reach the vSAN network on the witness host. Similarly, we can tell the witness host to use an alternate path to reach the vSAN network on the data hosts rather than via the default gateway.

Note once again that in most situations the vSAN network is most likely a stretched L2 broadcast domain between the data sites, but L3 is required to reach the vSAN network of the witness appliance. Therefore static routes are needed between the data hosts and the witness host for the vSAN network, but may not be required for the data hosts on different sites to communicate to each other over the vSAN network.

The esxcli commands used to add a static route is:

esxcli network ip route ipv4 add –n <remote network> -g <gateway>

Use the vmkping –I <vmk><ipaddress> command to check that the witness and physical hosts can communicate over the vSAN network. Now that the witness is up and accessible, forming a vSAN stretched cluster literally takes less than a couple of minutes. The following are the steps that should be followed to install vSAN stretched cluster. This example is a 2+2+1 deployment, meaning two ESXi hosts at the preferred site, two ESXi hosts at the secondary site and 1 witness host in a third location.

Configure Step 1a: Create a vSAN Cluster Stretched

In this example, there are four nodes available: esx01-sitea, esx02-sitea, esx01-siteb, and esx02-siteb as shown in Figure 8.6. All four hosts reside in a cluster called stretched-vsan. The fifth host witness-01, which is the witness host, is in its own datacenter and is not added to the cluster, but it has been added as an ESXi host to this vCenter Server.

Depending on how you are configuring your cluster you can decide to either create the stretched cluster during the creation of the vSAN cluster itself, or do this after the fact in the fault domain view. Both workflows are identical and the result will be similar. Note that new vSAN 6.2 functionality like deduplication and compression can also be enabled in a stretched cluster and checksums are fully supported. RAID-5/6 however are not supported in a stretched cluster configuration as this would require 4 or 6 fault domains at a minimum respectively.

Figure 8.6 - Start of stretched cluster creation

Configure Step 1b: Create Stretch Cluster

If your vSAN cluster has already been formed, it is also possible to create the stretched cluster separately. To configure stretch cluster and fault domains when a vSAN cluster already exists, navigate to the Manage > vSAN > Fault Domains view as shown in Figure 8.7, and click on the button “configure” in the stretched cluster section, that begins the stretch cluster configuration.

Figure 8.7 - Configure vSAN stretched cluster

Depending on whether you create the vSAN cluster as part of the workflow you may need to claim disks as well when the vSAN cluster is setup in manual mode.

Configure Step 2: Assign Hosts to Sites

At this point, hosts can now be assigned to stretch cluster sites as shown in Figure 8.8. Note that the names have been preassigned. The preferred site is the one that will run VMs in the event that there is a split-brain type scenario in the cluster. In this example, hosts esx01-sitea and esx02-sitea will remain in the preferred site, and hosts esx01-siteb and esx02-siteb will be assigned to the secondary site.

Figure 8.8 - Host selection and site placement

Configure Step 3: Select a Witness Host and Disk Group

The next step is to select the witness host. At this point, the host witness-01 is chosen. Note once again that this host does not reside in the cluster. It is outside of the cluster. In fact, in this setup as shown in Figure 8.9, it is in its own data center, but has been added to the same vCenter Server that is managing the stretched cluster.

Figure 8.9 - Witness host selection

When the witness is selected, a flash device and a magnetic disk need to be chosen to create a disk group. These are already available in the witness appliance (both are in fact VMDKs under the covers, since the appliance is a VM).

Configure Step 4: Verify the Configuration

Verify that the preferred fault domain and the secondary fault domains have the desired hosts, and that the witness host is the desired witness host as shown in Figure 8.10 and clickFinishto complete the configuration.

Figure 8.10 - Summary of stretched cluster configuration

When the stretched cluster has completed configuration, which can take a number of seconds, verify that the fault domain view is as expected.

Configure Step 5: Health Check the Stretched Cluster

Before doing anything else, use the vSAN health check to ensure that all the stretched cluster health checks have passed. These checks are only visible when the cluster has been configured as shown in Figure 8.11, and if there are any issues with the configuration, these checks should be of great assistance in locating them.

Figure 8.11 - Stretched cluster health

That may seem very easy from a vSAN perspective, but there are some considerations from a vSphere perspective to take into account. These are not required, but in most cases recommended to optimize for performance and availability. The vSAN stretched cluster guide outlines all vSphere recommendations in-depth. Since our focus in this book is vSAN and as such we will not go to such great level of detail but instead would like to refer you to this guide mentioned previously in this chapter. We will however list some of the key recommendations for each of the specific areas:

vSphere DRS:

  • Create a host group per data site, containing each of the host of the particular site.
  • Create VM groups per site, containing the VMs that should reside in a particular site.
  • Create a “should” soft rule for these groups to ensure that during “normal” operations these VMs reside in the correct site.

This will ensure that VMs will not freely roam around the stretched cluster, maintaining read locality, and performance is not impacted due to rewarming of the cache. It will also help from an operational perspective to provide insights around the impact of a full site failure and it will allow you to distribute scale-out services like active directory and for instance DNS across both sites.

vSphere HA:

  • Enable vSphere HA admission control and set it to use the percentage based admission control policy and to 50% for both CPU and memory. This means that if there is a full site failure, one site has enough capacity to run all of the VMs.
  • Enable “vSphere HA should respect VM/host affinity rules.”
  • Make sure to specify additional isolation addresses, one in each site using the advanced setting das.isolationAddress0 and das.isolationAddress1. This means that in the event of a site failure, the remaining site can still pings an isolation response IP address.
  • Disable the default isolation address if it can’t be used to validate the state of the environment during a partition. Setting the advanced setting das.usedefaultisolationaddress to false does this.
  • Disable datastore heartbeating, as without traditional external storage there is no reason to have this.
  • Select “vSphere HA should respect VM/host affinity rules” so that in the case of a single host failure VMs are restarted within their site.

These settings will ensure that when a failure occurs sufficient resources are available to coordinate the failover and power-on the VMs (admission control). These VMs will be restarted within their respective sites as defined in the VM/host rules. In the case of an isolation event, all necessary precautions have been taken to ensure all of the hosts can validate for isolation locally.

Failure Scenarios

There are many different failures that can occur in a virtual datacenter. It is not our goal to describe each and every single one of them, as that would be a book by itself. In this section we want to describe some of the more common failures, and recovery of these failures, which are particular to the stretched cluster configuration.

In this example, there is a 1+1+1 stretched vSAN deployment. This means that there is a single data host at site 1, a single data host at site 2 and a witness host at a third site.

A single VM has also been deployed. When the physical disk placement is examined, we can see that the replicas are placed on the preferred and secondary data site respectively, and the witness component is placed on the witness site as shown in Figure 8.12.

Figure 8.12 - VM component placement

The next step is to introduce some failures and examine how vSAN handles such events. Before beginning these tests, please ensure that the vSAN health check plugin is working correctly, and that all vSAN health checks have passed.

Note: In a 1+1+1 configuration, a single host failure would be akin to a complete site failure.

The health check plugin should be referred to regularly during failure scenario testing. Note that alarms are now raised in version 6.1 for any health check that fails. Alarms may also be reference at the cluster level throughout this testing.

Finally, when the term site is used in the failure scenarios, it implies a fault domain.

Single data host failure—Secondary site

The first test is to introduce a failure on a host on one of the data sites, either the “preferred” or the “secondary” site (see Figure 8.13). The sample virtual machine deployed for test purposes currently resides on the preferred site.

Figure 8.13 - Failure scenario—host failed secondary site

In the first part of this test, the secondary host has been rebooted, simulating a temporary outage.

There will be several power and HA events related to the secondary host visible in the vSphere Web Client UI. Change to the physical disk place view of the virtual machine. After a few moments, the components that were on the secondary host will go “absent,” as shown in Figure 8.14.

Figure 8.14 - VM component absent

However the virtual machine continues to be accessible. This is because there is a full copy of the data available on the host on the preferred site, and there are more than 50% of the votes available. Open a console to the virtual machine and verify that it is still very much active and functioning. Since the ESXi host which holds the compute of the virtual machine is unaffected by this failure, there is no reason for vSphere HA to take action.

At this point, the vSAN health check plugin can be examined. There will be quite a number of failures, as shown in Figure 8.15, due to the fact that the secondary host is no longer available, as one might expect.

Figure 8.15 - Health check tests failed

When going through these tests yourself, please note that before starting a new test it is recommended to wait until the failed host has successfully rejoined the cluster. All “failed” health check tests should show OK before another test is started. Also confirm that there are no “absent” components on the VMs objects, and that all components are once again active.

Single data host failure—Preferred site

This next test will not only check vSAN, but it will also verify vSphere HA functionality, and that the VM to host affinity rules which we recommended are working correctly. If each site has multiple hosts, then a host failure on the primary site will allow vSphere HA to start the virtual machine on another host on the same site. In this test, the configuration is 1+1+1 so the virtual machine will have to be restarted on the secondary site. This will also verify that the VM to host affinity “should” rule is working (see Figure 8.16).

Figure 8.16 - Failure scenario—host failed preferred site

After the failure has occurred in the preferred site there will be a number of vSphere HA related events. Similar to the previous scenario, the components that were on the preferred host will show up as “absent.”

Since the host on which the virtual machine’s compute resides is no longer available, vSphere HA will restart the virtual machine on another host in the cluster. It is important to validate this has happened as it shows that the VM/host affinity rules are correctly configured as “should” rules and not as “must” rules. If “must” rules are configured then vSphere HA will not be able to restart the virtual machine on the other site, so it is important that this test behaves as expected. “Should” rules will allow vSphere HA to restart the virtual machine on hosts that are not in the VM/host affinity rules when no other hosts are available.

Note that if there were more than one host on each site, then the virtual machine would be restarted on another host on the same site as a result of the “vSphere HA should respect VM/host affinity rules” setting. However, since this is a test on a 1+1+1 configuration, there are no additional hosts available on the preferred site. Therefore, the VM is restarted on a host on the secondary site after roughly 30 to 60 seconds.

Witness host failure—Witness site

A common question that is asked is what happens when the witness host has failed as shown in Figure 8.17? This should have no impact on the run state of the virtual machine since there is still a full copy of the data available and greater than 50% of the votes are also available, but the witness components residing on the witness host should show up as “absent.”

Figure 8.17 - Failure scenario—witness host failed

In our environment we’ve simply powered off the witness host to demonstrate the impact of a failure. After a short period of time, the witness component of the virtual machine appears as “absent” as shown in Figure 8.18.

Figure 8.18 - Witness component absent

However the virtual machine is unaffected and continues to be available and accessible. The rule for vSAN virtual machine object accessibility is: at least one full copy of the data must be available, and more than 50% of the components that go to make up the object are available. In this scenario both copies of the data are available and more than 50%, leaving access to the VM intact.

Network failure—Data Site to Data Site

The last failure scenario we want to describe is a site partition. If you are planning on testing this scenario, which we highly recommend, please ensure before conducting the tests that the host isolation response and host isolation addresses are configured correctly. At least one of the isolation addresses should be pingable over the vSAN network by each host in the cluster. The environment shown in Figure 8.19 depicts our configuration and the failure scenario.

Figure 8.19 - Witness component absent

This scenario is special as when the intersite link has failed the “preferred” site forms a cluster with the witness, and the majority of components (data components and witness) will be available to this part of the cluster. The secondary site will also form its own cluster, but it will only have a single copy of the data and will not have access to the witness. This results in two components of the virtual machine object getting marked as absent on the secondary site (see Figure 8.20) since the host can no longer communicate to the other data site where the other copy of the data resides, nor can it communicate to the witness. This means that the VMs can only run on the preferred site, where the majority of the components are accessible.

Figure 8.20 - Two components absent on the secondary site

From a vSphere HA perspective, since the host isolation response IP address is on the vSAN network and site local, both data sites should be able to reach the isolation response IP address on their respective sites. Therefore vSphere HA does not trigger a host isolation response! This means that the VMs that are running in the secondary site, which has lost access to the vSAN datastore, cannot write to disk but are still running from a compute perspective. It should be noted that during the recovery the host that has lost access to the disk components will instantly kill the VM instances. This does however mean that until the host has recovered potentially two instances of the same VM can be accessed over the network, of which one is capable of writing to disk and the other is not.

As of vSAN 6.2 a new mechanism has been introduced to avoid this situation. This feature will automatically kill the VMs that have lost access to all vSAN components on the secondary site. This is to ensure they can be safely restarted on the primary site, and when the link recovers there will not be two instances of the same VM running, even for a brief second. If you want to disable this behavior, you can set the advanced host setting called vSAN.AutoTerminateGhostVm to 0. Note that before 6.2 these VMs needed to be killed manually, and VMware even provided a script to take care of this via VMware Knowledge Base Article 2135952.

On the preferred site, the impacted VMs will be almost instantly restarted. By navigating to the policies view after the virtual machine has been restarted on the hosts on the preferred site, and click on physical disk placement, it should show that two out of the three components are available, and since there is a full copy of the data and more than 50% of the components available, the VM is accessible.

Recovering from a complete site failure

The descriptions of the host failures previously, although related to a single host failure, are also complete site failures. VMware has modified some of the vSAN behavior when a site failure occurs and subsequently recovers. In the event of a site failure, vSAN will now wait for some additional time for “all” hosts to become ready on the failed site before it starts to sync components. The main reason is that if only some subset of the hosts come up on the recovering site, then vSAN will start the rebuild process. This may result in the transfer of a lot of data that already exists on the nodes that might become available at some point in time later on, simply because of the boot sequence of the hosts.

It is recommended that when recovering from a failure, especially a site failure, all nodes in the site should be brought back online together to avoid costly resync and reconfiguration overheads. The reason behind this is that if vSAN bring nodes back up at approximately the same time, then it will only need to synchronize the data that was written between the time when the failure occurred and the when the site came back. If, instead, nodes are brought back up in a staggered fashion, objects might to be reconfigured and thus a significant higher amount of data will need to be transferred between sites. This is also the reason why VMware recommend setting DRS to partially automated mode rather than fully automated mode if there is a complete site failure. Administrators can then wait for the failing site to be completely remediated before allowing any VMs to migrate back to it via the affinity rules.


A vSAN stretched cluster architecture will allow you to deploy and migrate workloads across two locations without the need for complex storage configurations and operational processes. On top of that it comes at a relative low cost that enables the majority of VMware users to deploy this configuration when there are dual datacenter requirements. As with any explicit architecture there are various different design and operational considerations. We would like to refer you to the official VMware documentation as the source of the most updated and accurate information.

results matching ""

    No results matching ""