Chapter 5 - Architectural Details

This chapter examines some of the underlying architectural details of vSAN. We have already touched on a number of these aspects of vSAN, including the use of flash devices for caching I/O, the role of VASA in surfacing up vSAN capabilities, VM storage policies, witness disks, the desire for pass-through storage controllers, and so on.

This chapter covers these features in detail, in addition to the new architectural concepts and terminology that is introduced by vSAN. Although most vSphere administrators will never see many of these low-level constructs, it will be useful to have a generic understanding of the services that make up vSAN when troubleshooting or when analyzing log files. Before examining some of the lower-level details, here is one concept that we need to discuss first as it is the core of vSAN: distributed RAID (Redundant Array of Inexpensive Disks).

Distributed RAID

vSAN is able to provide highly available and excellent performing VMs through the use of distributed RAID, or to put another way, RAID over the network. From an availability perspective, distributed RAID simply implies that the vSAN environment can withstand the failure of one or more ESXi hosts (or components in that host, such as a disk drive) and continue to provide complete functionality for all your VMs. To ensure that VMs perform optimally, vSAN distributed RAID provides the ability to divide virtual disks across multiple physical disks and hosts.

A point to note, however, is that VM availability and performance is now defined on a per-VM basis through the use of storage policies. Actually, to be more accurate, it is defined on a per-virtual disk basis. Using a storage policy, administrators can now define how many host or disk failures a VM can tolerate in a vSAN cluster and across how many hosts and disks a virtual disk is deployed. If you explicitly choose not to set an availability requirement in the storage policy by setting number of failures to tolerate equal to zero, a host or disk failure can certainly impact your VM’s availability.

In the earlier releases, vSAN uses RAID-1 (synchronous mirroring) exclusively across hosts to meet the availability and reliability requirement of storage objects deployed on the system. The number of mirror copies (replicas) of the VM storage objects depended on the VM’s storage policy, in particular the number of failures to tolerate requirement. Depending on the VM storage policy, you could have up to three replicas of a VM disk (VMDK) across a vSAN cluster for availability. By default, vSAN always deploys VMs with number of failures to tolerate equal to 1; there is always a replica copy of the VM storage objects for every VM deployed on the vSAN datastore. This is the default policy associated with vSAN datastores. This can be changed based on the policy selected during VM provisioning.

In vSAN 6.2 two new RAID types are introduced. The first of these is RAID-5 and the second is RAID-6. These are created when the failure tolerance method capability setting is set to capacity in the VM storage policy rather than performance (which is the default). The purpose of introducing these additional distributed RAID types is to save on capacity usage. Both RAID-5 and RAID-6 use a distributed parity mechanism rather than mirrors to protect the data. With RAID-5, the data is distributed across three disks on three ESXi hosts, and then the parity of this data is calculated and stored on a fourth disk on a fourth ESXi host. The parity is not always stored on the same disk or on the same host. It is distributed, as shown in Figure 5.1.

Figure 5.1 - RAID-5 deployment with distributed parity

A RAID-5 configuration can tolerate at least one host failure. RAID-6 is designed to tolerate two host failures. In a RAID-6 configuration, data is distributed across four disks on four ESXi hosts, and when the parity is calculated, it is stored on two additional disks on two additional ESXi hosts. Therefore, if you wish to utilize a RAID-6 configuration, a total of six ESXi hosts are required. Once again the parity is distributed, as shown in Figure 5.2

Figure 5.2 - RAID-6 deployment with distributed parity

The space savings can be calculated as follows. If you deploy a 100 GB VMDK object and wish to tolerate one failure using a RAID-1 configuration, a total of 200 GB of capacity would be consumed on the vSAN datastore. With RAID-5, a total of 133.33 GB would be consumed. Similarly, if you deploy the same 100 GB VMDK object and wish to tolerate two failures using a RAID-1 configuration, a total of 300 GB of capacity would be consumed on the vSAN datastore. With RAID-6, a total of 150 GB would be consumed.

As discussed in Chapter 4, “VM Storage Policies on vSAN,” administrators may now choose between performance and capacity. If performance is the absolute end goal for administrators, then RAID-1 (which is still the default) is the failure tolerance method that should be used. If administrators do not need maximum performance, and are more concerned with capacity usage, then RAID-5/6 as the failure tolerance method may be used.

Depending on the number of disk stripes per object policy setting, a VM disk object may be “striped” across a number of disk spindles to achieve a desired performance. Performance of VM storage objects can be improved via a RAID-0 configuration; however, a stripe configuration does not always necessitate an improvement in performance. The section “Stripe Width Policy Setting,” later in this chapter, explains the reasons for this as well as when it is useful to increase the stripe width of a VMDK in the VM storage policy.

Objects and Components

It is important to understand the concept that the vSAN datastore is an object storage system and that VMs are now made up of a number of different storage objects. This is a new concept for vSphere administrators as traditionally a VM is made up of a set of files on a LUN or volume.

We have not spoken in great detail about object and components so far, so before we go into detail about the various types of objects, let’s start with the definition and concepts of an object and component on vSAN.

An object is an individual storage block device, compatible with SCSI semantics that resides on the vSAN datastore. It may be created on demand and at any size, although VMDKs were limited to 2 TB–512 bytes in the initial release of vSAN version 5.5. Since the release of vSAN 6.0, VMDKs of up to 62 TB are now supported, in line with VMFS (Virtual Machine File System) and network file system (NFS) datastores. Objects now replace logical unit numbers (LUNs) as the main unit of storage on vSAN. In vSAN, the objects that make up a virtual machine are VMDKs, VM home namespace, and VM swap. Of course if a snapshot is taken of the virtual machine, then a delta disk object is also created. If the snapshot includes the memory of the virtual machine, this is also instantiated as an object, so a snapshot could be made up of either one or two objects, depending on the snapshot type. Each “object” in vSAN has its own RAID tree that turns the requirements into an actual layout on physical devices. When a VM storage policy is selected during VM deployment, the requirements around availability and performance in the policy applies to the VM objects.

Components are leaves of the object’s RAID trees—that is, a “piece” of an object that is stored on a particular “cache device + capacity device” combination (in a physical disk group). A component gets transparent caching/buffering from the cache device (which is always flash), with its data “at rest” on a capacity device (which could be flash in all-flash vSAN configurations or magnetic disk in hybrid vSAN configurations).

A VM can have five different types of objects on a vSAN datastore as follows, keeping in mind that each VM may have multiples of some of these objects associated with it:

  • The VM home or “namespace directory”
  • A swap object (if the VM is powered on)
  • Virtual disks/VMDKs
  • Delta disks (each an object) created for snapshots
  • Snapshot memory (each an object) optionally created for snapshots

Of the five objects, the VM namespace may need a little further explanation. All VMs files, excluding VMDKs, deltas (snapshots), memory (snapshots) and swap, reside in an area called the VM namespace on vSAN. The typical files found in the VM home namespace are the .vmx, the .log files, .vmdk descriptor files, and snapshot deltas descriptors files and everything else one would expect to find in a VM home directory.

Each storage object is deployed on vSAN as a RAID tree, and each leaf of the tree is said to be a component. For instance, if I choose to deploy a VMDK with a stripe width of 2, but did not wish to tolerate any failures (for whatever reason), a RAID-0 stripe would be configured across a minimum of two disks for this VMDK. The VMDK would be the object, and each of the stripes would be a component of that object.

Similarly, if I specified that my VMDK should be able to tolerate at least one failure in the cluster (host, disk, or network), and left all the other policy settings at their defaults, a RAID-1 mirror of the VMDK object would be created with one replica component on one host and another replica component on another host in my vSAN cluster. Finally, if my policy included a requirement for both striping and availability, my striped components would be mirrored across hosts, giving me a RAID 0+1 configuration. This would result in four components making up my single object, two striped components in each replica.

Note that delta disks are created when a snapshot is taken of a VM. A delta disk inherits the same policy as the parent disk (stripe width, replicas, and so on).

The swap object is created only when the VM is powered on.

There is another component called the witness. The witness component is very important and special. Although it does not directly contribute toward VM storage, it is nonetheless an important component required to determine a quorum for a VM’s storage objects in the event of a failure in the cluster. We will return to the witness component shortly, but for the moment let’s concentrate on VM storage objects.

Component Limits

One major limit applies in relation to components in vSAN. It is important to understand this because it is a hard limits and essentially will limit the number of VMs you can run on a single host and in your cluster. The limitation of the original vSAN 5.5 were as follows:

  • Maximum number of components per host limit: 3,000

In vSAN 6.0, the number of components per host limit was increased, as a new on-disk format was introduced.

  • Maximum number of components per host limit: 9,000

Components per host include components from powered-off VMs, unregistered VMs, and templates. vSAN distributes components across the various hosts in the cluster and will always try to achieve an even distribution of components for balance. However, some hosts may have more components than others, which is why VMware recommends, as a best practice, that hosts participating in a vSAN cluster be similarly or identically configured. Components are a significant sizing consideration when designing and deploying vSAN clusters, as discussed in further detail in Chapter 9, “Designing a vSAN Cluster.”

The vSphere Web Client enables administrators to interrogate objects and components of the VM home namespace and the VMDKs of a VM. Figure 5.3 provides an example of one such layout. The VM has one hard disk, which is mirrored across two different hosts, as you can see in the “hosts” column, where it shows the location of the components.

Figure 5.3 - Physical disk placement

Virtual Machine Storage Objects

As stated earlier, the five storage objects are VM home namespace, VM Swap, VMDK, delta disks and snapshot memory, as illustrated in Figure 5.4. We will ignore Snapshot Memory for the moment, and discuss the other objects in a little more detail.

Figure 5.4 - VM storage objects

We will now look at how characteristics defined in the VM storage policy impact these storage objects. Note that not all the VM storage objects implement these policies.

Namespace

Virtual machines use the namespace object as their VM home, and use it to store all of the virtual machine files that are not dedicated objects in their own right. So, for example, this includes, but is not limited to, the following:

  • The .vmx, .vmdk (the descriptor portion), .log files that the VMX uses.
  • Digest files for CBRC (content-based read cache) for VMware horizon view. This feature is referred to as the view storage accelerator. Virtual desktop infrastructure (VDI) is a significant use case for vSAN.
  • vSphere Replication and Site Recovery Manager files.
  • Guest customization files.
  • Files created by other solutions.

These are unique objects and there is one per VM. vSAN leverages VMFS as the file system within the namespace object to store all the files of the VM. This is a fully fleshed, vanilla VMFS. That is, it includes the cluster capabilities, so that we can support all the solutions that use locks on VMFS (e.g., vMotion, vSphere High Availability [HA]). This appears as an auto-mounted subdirectory when you examine the ESXi hosts’ file systems. However, although it is a vanilla VMFS being used, it does not have the same limitations as VMFS has in other environments because the number of hosts connected to these VM home namespace VMFS volumes is at most two (e.g., during a vMotion), whereas in traditional environments the same VMFS volume could be shared between dozens of hosts. In other words, vSAN leverages these vanilla VMFS volumes in a completely different way, allowing for greater scale and better performance.

For the VM home, a special VM storage policy is used. For the most part, the VM home storage object does not inherit all of the same policy requirements as the VMDKs. If you think about it, why would you want to give something like the VM home namespace a percentage of flash read cache (hybrid only) or a stripe width? You wouldn’t, which is why the VM home namespace does not have these settings applied even when they are in the policy associated with the virtual machine. The VM home namespace does, however, inherit the number of failures to tolerate setting. This allows the VM to survive multiple hardware failures in the cluster. With the release of vSAN 6.2, it also inherits the failure tolerance method policy setting, which means that the VM home namespace could be deployed as a RAID-5 or RAID-6 configuration, not just a RAID-1 configuration as was the case in prior versions of vSAN.

So, because high performance is not a major requirement for the VM home namespace storage object, vSAN overwrites the inherited policy settings so that stripe width is always set to 1 and read cache reservation is always set to 0%. It also has object space reservation set to 0% so that it is always thinly provisioned. This avoids the VM home namespace object consuming unnecessary resources, and makes these resources available to objects that might need them, such as VMDKs.

One other important note is that if force provisioning is set in the policy, the VM home namespace object also inherits that, meaning that the VM will be deployed even if the full complement of resources is not available.

In vSAN 6.2, a new performance service was introduced. This service aggregates performance information from all of the ESXi hosts in the cluster and stores the metrics in a stats database on the vSAN datastore. The object in which the “stats DB” is stored is also a namespace object. Therefore, the use of namespace objects is not limited to virtual machines, although this is the most common use.

Virtual Machine Swap

The VM swap object also has its own special policy settings. For the VM swap object, its policy always has number of failures to tolerate set to 1. The main reason for this is that swap does not need to persist when a virtual machine is restarted. Therefore, if HA restarts the virtual machine on another host elsewhere in the cluster, a new swap object is created. Therefore, there is no need to add additional protection above tolerating one failure.

By default, swap objects are provisioned 100% up front, without the need to set object space reservation to 100% in the policy. This means, in terms of admission control, vSAN will not deploy the VM unless there is enough disk space to accommodate the full size of the VM swap object. In vSAN 6.2, a new advanced host option SwapThickProvisionDisabled has been created to allow the VM swap option to be provisioned as a thin object. If this advanced setting is set to true, the VM swap objects will be thinly provisioned.

VMDKs and Deltas

As you have seen, VM home namespace and VM swap have their own default policies when a VM is deployed and do not adhere to all of the capabilities set in the policy. Therefore, it is only the VMDKs and snapshot files (delta disks) of these disk files that obey all the capabilities that are set in the VM storage policies.

Because vSAN objects may be made up of multiple components, each VMDK and delta has its own RAID tree configuration when deployed on vSAN.

Witnesses and Replicas

As part of the RAID-1 tree, each object usually has multiple replicas, which as we have seen, could be made up of one or more components. We mentioned that when we create VM storage objects, one or more witness components may also get created. Witnesses are part of each and every object in the RAID-1 tree. They are components that make up a leaf of the RAID-1 tree, but they contain only metadata. They are there to be tiebreakers and are used for quorum determination in the event of failures in the vSAN cluster.

Let’s take the easiest case to explain their purpose: Suppose, for example, that we have deployed a VM that has a stripe width setting of 1 and it also has number of failures to tolerate setting of 1. We do not wish to use RAID-5 or RAID-6 in this example. In this case, two replica copies of the VM need to be created. Effectively, this is a RAID-1 with two replicas; however, with two replicas, there is no way to differentiate between a network partition and a host failure. Therefore, a third entity called the witness is added to the configuration. For an object on vSAN to be available, two conditions have to be met:

  • The RAID tree must allow for data access. (With a RAID-1 configuration, at least one full replica needs to be intact. With a RAID-0 configuration, all stripes need to be intact.) With the new RAID-5 and RAID-6 configurations, three out of four RAID-5 components must still be available, and four out of the six RAID-6 components must still be available.
  • In the earlier versions of vSAN, the rule was that there must be more than 50% of all components available. With the introduction of votes associated with components in vSAN 6.0, this rule changed to having more than 50% of the votes available.

In the preceding example, only when there is access to one replica copy and a witness, or indeed two replica copies (and no witness), would you be able to access the object. That way, at most one part of the cluster can ever access an object in the event of a network partition.

A common question is whether the witness consumes any space on the vSAN datastore. In the original vSAN version 5.5, which uses VMFS as the on-disk format, a witness consumes about 2 MB of space for metadata on the vSAN datastore. With the release vSAN version 6.0, which uses vSANFS as the on-disk format, a witness consumes about 4 MB of space for metadata on the vSAN datastore. Although insignificant to most, it could be something to consider when running through design, sizing, and scaling exercises when planning to deploy many VMs with many VMDKs on vSAN.

Object Layout

The next question people usually ask is how objects are laid out in a vSAN environment. As mentioned, the VM home namespace for storing the VM configuration files is formatted with VMFS. All other VM disk objects (whether VMDKs or snapshots) are instantiated as distributed storage objects in their own right.

Although vSAN takes care of object placement to meet the number of failure to tolerate and failure tolerance method requirements and an administrator should not worry about these placement decisions, we understand that with a new solution you may have the desire to have a better understanding of physical placement of components and objects. VMware expected that administrators would have this desire; therefore, the vSphere user interface allows vSphere administrators to interrogate the layout of a VM object and see where each component (stripes, replicas, witnesses) that make up a storage object reside, as shown in Figure 5.5.

Figure 5.5 - RAID-1, RAID-0, and witnesses

vSAN will never let components of different replicas (mirrors) share the same host for availability purposes.

Note that in versions of vSAN prior to vSAN 6.2, we do not see the VM swap file objects. This was because the swap file UUID (universally unique identifier) was not available through the VIM (virtual infrastructure management) application programming interface (API), so neither the Ruby vSphere console (also known as RVC, covered later in Chapter 10, “Troubleshooting, Monitoring, and Performance”) nor the vSphere Web Client can show this information. However, there is a method to retrieve swap file information in the previous version of vSAN, as demonstrated shortly. Snapshot/delta disk objects are not visible in the vSphere user interface (UI) either in versions of vSAN prior to 6.2, but these objects implicitly inherit the same policy settings as the VMDK base disk against which the snapshot is taken.

In vSAN 6.2, both the VM swap objects and snapshot deltas are visible via the vSphere Web Client. If administrators navigate to the vSAN cluster > monitor tab, and select virtual disks, the objects are listed there. We have been talking about the concept of VM storage policies for a while now; let’s now consider this further.

Default VM Storage Policy

VMware encourages administrators to create their own policies with vSAN and not rely on the default policy settings. However, if you decide to deploy a VM on a vSAN datastore without selecting a policy, a default policy is applied. The default policy, called vSAN default storage policy, has been created with very specific characteristics to prevent administrators from unintentionally putting VMs and the associated data at risk when, for whatever reason, a policy is not selected. We have seen this happening fairly often in the early version of vSAN when administrators create VMs in a hurry and simply forget to select a policy. However, we do need to stress that VMware strongly encourages administrators to create their own VM storage policies, even when the requirements are the same as those in the default policy. For one thing, it enables the administrator to do meaningful compliance reporting.

The default policy can be observed from the vSphere web client. Let’s inspect it in Figure 5.6:

Figure 5.6 - vSAN default storage policy

From this, we can deduce that the storage objects will always be deployed with a number of failures to tolerate set to 1. What is missing from the view in Figure 5.6 is failure tolerance method policy setting, which by default is set to RAID-1 (mirroring) performance. This means that RAID-5 (not RAID-6) configurations are not used. In other words, this storage object will be deployed in a RAID-1 mirror. If the failure tolerance method policy is set to RAID-5/6 (erasure coding)—capacity, then RAID-5 or RAID-6 is implemented, based on the value of number of failures to tolerate. This relationship is discussed in detail in Chapter 4. Figure 5.7 shows the various settings that the fault tolerance method can take in a policy.

Figure 5.7 - Fault tolerance method

Another point to make about the default policy is that checksum will be enabled; in other words the capability disable checksum will be set to false.

A final point is related to force provisioning. By default, force provisioning is set to No; it is only explicitly enabled for the VM swap object. If we would like to, for instance, bootstrap a vCenter Server onto a single-host vSAN cluster, then it would not be possible with the current policy settings. This is because the creation of the VM will fail since the default policy has number of failures to tolerate set to 1, and since only a single host is available in the vSAN cluster, vSAN cannot adhere to these requirements. To allow for this to work, a change in default profile is needed, and force provisioning will need to be set to “Yes” in the default policy.

Now that we will shortly take a look at what an administrator can define in a policy other than using the default policy.

vSAN Software Components

This section briefly outlines some of the software components that make up the distributed software layer.

Much of this information will not be of particular use to vSphere administrators on a day-to-day basis. All of this complexity is hidden away in how VMware has implemented the installation and configuration of vSAN to a few simple mouse clicks. However, we did want to highlight some of the major components behind the scenes for you because, as mentioned in the introduction, you may see messages from time to time related to these components appearing in the vSphere UI and the VMkernel logs, and we want to provide you with some background on what the function is of these components. Also, when you begin to use the RVC in Chapter 10, a number of the outputs will refer to these software components, which is another reason why we are including this brief outline.

The vSAN architecture consists of four major components, as illustrated in Figure 5.8 and described in more depth in the sections that follow.

Figure 5.8 - vSAN software components

Component Management

The vSAN local log structured object manager (LSOM) works at the physical disk level. It is the LSOM that provides for the storage of VM storage object components on the local disks of the ESXi hosts, and it includes both the read caching (for hybrid configurations) and write buffering (for both hybrid and all-flash configuration) for these objects. When we talk in terms of components, we are talking about one of the striped components that make up a RAID-0 configuration, or one of the replicas that makes up a RAID-1 configuration. It could also be a data or parity segment for RAID-5 or RAID-6 configurations. Therefore, LSOM works with the magnetic disks, solid-state disks (SSDs) and flash devices on the ESXi hosts.

Another way of describing the LSOM is to state that it is responsible for providing persistence of storage for the vSAN cluster. By this, we mean that it stores the components that make up VM storage objects as well as any configuration information and the VM storage policy.

LSOM reports events for these devices, for example, if a device has become unhealthy. The LSOM is also responsible for retrying I/O if transient device errors occur.

LSOM also aids in the recovery of objects. On every ESXi host boot, LSOM performs an SSD log recovery. This entails a read of the entire log that ensures that the in-memory state is up to date and correct. This means that a reboot of an ESXi host that is participating in a vSAN cluster can take longer than an ESXi host that is not participating in a vSAN cluster.

Data Paths for Objects

The distributed object manager (DOM) provides distributed data access paths to objects built from local (LSOM) components. The DOM is responsible for the creation of reliable, fault-tolerant VM storage objects from local components across multiple ESXi hosts in the vSAN cluster. It does this by implementing distributed RAID types for objects.

DOM is also responsible for handling different types of failures such as I/O failing from a device and unable to contact a host. In the event of an unexpected host failure, during recovery DOM must resynchronize all the components that make up every object. Components publish a bytesToSync value periodically to show the progress of a synchronize operation. This can be monitored via the vSphere web client UI when recovery operations are taking place.

Object Ownership

We discuss object owners from time to time in this chapter. Let’s elaborate a little more about what object ownership is. For every storage object in the cluster, vSAN elects an owner for the object. The owner can be considered the storage head responsible for coordinating (internally within vSAN) who can do I/O to the object. The owner basically is the entity that ensures consistent data on the distributed object by performing a transaction for every operation that modifies the data/metadata of the object.

As an analogy, in NFS configurations, consider the concept of NFS server and NFS client. Only certain clients can communicate successfully with the server. In this case, the vSAN storage object owner can be considered along the same lines as an NFS server, determining which clients can do I/O and which clients cannot. The final part of object ownership is the concept of a component manager. The component manager can be thought of as the network front end of LSOM (in other words, how a storage object in vSAN can be accessed).

An object owner communicates to the component manager to find the leaves on the RAID tree that contain the components of the storage object. Typically, there is only one client accessing the object. However, in the case of a vMotion operation, multiple clients may be accessing the same object. In the vast majority of cases, the owner entity and client co-reside on the same node in the vSAN cluster.

Placement and Migration for Objects

The cluster level object manager (CLOM) is responsible for ensuring that an object has a configuration that matches its policy (i.e., the requested stripe width is implemented or that there are a sufficient number of mirrors/replicas in place to meet the availability requirement of the VM). Effectively, CLOM takes the policy assigned to an object and applies a variety of heuristics to find a configuration in the current cluster that will meet that policy. It does this while also load balancing the resource utilization across all the nodes in the vSAN cluster.

DOM then applies a configuration as dictated by CLOM. CLOM distributes components across the various ESXi hosts in the cluster. CLOM tries to create some sort of balance, but it is not unusual for some hosts to have more components, capacity used/reserved, or flash read cache used/reserved than others.

Each node in a vSAN cluster runs an instance of CLOM, called clomd. Each instance of CLOM is responsible for the configurations and policy compliance of the objects owned by the DOM on the ESXi host where it runs so it needs to communicate with cluster monitoring, membership, and directory service (CMMDS) to be aware of ownership transitions. CLOM only communicates with entities on the node where it runs. It does not use the network.

Cluster Monitoring, Membership, and Directory Services

The purpose of cluster monitoring, membership, and directory services (CMMDS) is to discover, establish, and maintain a cluster of networked node members. It manages the physical cluster resources inventory of items such as hosts, devices, and networks and stores object metadata information such as policies, distributed RAID configuration, and so on in an in-memory database. The object metadata is always also persisted on disk. It is also responsible for the detection of failures in nodes and network paths.

Other software components browse the directory and subscribe to updates to learn of changes in cluster topology and object configuration. For instance, DOM can use the content of the directory to determine the nodes storing the components of an object and the paths by which those nodes are reachable.

Note that CMMDS forms a cluster (and elects a master) only if there is multicast network connectivity between the hosts.

CMMDS is used to elect “owners” for objects. The owner of an object will perform all the RAID tasks for a particular object, as discussed earlier.

Host Roles (Master, Slave, Agent)

When a vSAN cluster is formed, you may notice through esxcli commands that each ESXi host in a vSAN cluster has a particular role. These roles are for the vSAN clustering service only. The clustering service (CMMDS) is responsible for maintaining an updated directory of disks, disk groups, and objects that resides on each ESXi host in the vSAN cluster. This has nothing to do with managing objects in the cluster or doing I/O to an object by the way; this is simply to allow nodes in the cluster to keep track of one another. The clustering service is based on a master (with a backup) and agents, where all nodes send updates to the master and the master then redistributes them to the agents, using a reliable ordered multicast protocol that is specific to vSAN. This is the reason why the vSAN network must be able to handle multicast traffic, as discussed in the earlier chapters of this book. Roles are applied during a cluster discover, at which time the ESXi hosts participating in the vSAN cluster elect the master. A vSphere administrator has no control over which role a cluster member takes.

A common question is why a backup role is needed. The reason for this is because if the ESXi host that is currently in the master role suffers a catastrophic failure and there is no backup, all ESXi hosts must reconcile their entire view of the directory with the newly elected master. This would mean that all the nodes in the cluster might be sending all their directory contents from their respective view of the cluster to the new master. By having a backup, this negates the requirement to send all of this information over the network, and thus speeds up the process of electing a new master node.

In the case of vSAN stretched clusters, a configuration introduced in vSAN 6.1 that allows nodes in a vSAN cluster to be geographically dispersed across different sites, the master node will reside on one site whilst the backup node will reside on the other site.

An important point to make is that, to a user or even a vSphere administrator, the ESXi node that is elected the role of master has no special features or other visible differences. Because the master is automatically elected, even on failures, and given that the node has no user visible difference in abilities, doing operations on a master node versus any other node doesn’t matter at all.

Reliable Datagram Transport

The reliable datagram transport (RDT) is the communication mechanism within vSAN. It uses TCP (Transmission Control Protocol) at the transport layer. It creates and tears down TCP connections (sockets) on demand.

RDT is built on top of the vSAN clustering service. The clustering service uses heartbeats to determine link state. If a link failure is detected, RDT will drop connections on the path and choose a different healthy path.

When an operation needs to be performed on a vSAN object, DOM uses RDT to talk to the owner of the vSAN object. Because the RDT promises reliable delivery, users of the RDT can rely on it to retry requests after path or node failures, which may result in a change of object ownership and hence a new path to the owner of the object. The CMMDS (via its heart beating and monitoring functions) and the RDT are responsible for handling timeouts and path failures.

On-Disk Formats

Before looking at the various I/O-related flows, let’s briefly discuss the on-disk formats used by vSAN for the different types of devices used in a vSAN configuration.

Cache Devices

VMware uses its own proprietary on-disk format for the flash devices used in the cache layer by vSAN. In hybrid configurations, which has both a read cache and a write buffer, the read cache portion of the flash device has its own on-disk format, and there is also a log-structured format for the write buffer portion of the flash device. In the case of all-flash configuration, there is only a write buffer; there is no read cache. Both formats are specially designed to boost the endurance of the flash device beyond the basic functionality provided by the flash device firmware.

Capacity Devices

It may come as a surprise to some, but in the original vSAN 5.5 release, VMware used the Virtual Machine File System (VMFS) as the on-disk format for vSAN. However, this was not the traditional VMFS. Instead, there is a new format unique to vSAN called VMFS local (VMFS-L). VMFS-L is the on-disk file system format of the local storage on each ESXi host in vSAN. The standard VMFS file system is specifically designed to work in clustered environments where many hosts are sharing a datastore. It was not designed with single-host/local disk environments in mind, and certainly not distributed datastores. VMFS-L was introduced for use cases like distributed storage. Primarily, the clustered on-disk locking and associated heartbeats on VMFS were removed. These are necessary only when many hosts share the file system. They are unnecessary when only a single host is using it. Now instead of placing a SCSI reservation on the volume to place a lock on the metadata, a new lock manager is implemented that avoids using SCSI reservation completely. VMFS-L does not require on-disk heartbeating either. Now it simply updates an in-memory copy of the heartbeat (because no other host needs to know about the lock). Tests have shown that VMFS-L can provision disks in about half the time of standard VMFS with these changes incorporated.

In vSAN 6.0, a new on-disk format was introduced. This was based on the VirstoFS, a high performance, sparse filesystem from a company called Virsto that VMware acquired. This is referred to as the v2 format, and improved the performance of snapshots (through a new vsanSparse format) and clones on vSAN 6.0. Customers could upgrade from VMFS-L (v1) to vSANFS (v2) through a seamless rolling upgrade process, where the content of each host’s disk group was evacuated elsewhere in the cluster, the disk group on the host was removed and recreated with the new v2 on-disk format, and this process was repeated until all disk groups were upgraded.

In vSAN 6.2, to accommodate new features and functionality such as deduplication, compression and checksum, another on-disk format (v3) is introduced. This continues to be based on vSANFS, but has some additional features for the new functionality.

vSAN I/O Flow

In the next couple of paragraphs, we will trace the I/O flow on both a read and a write operation from an application within a guest OS when the VM is deployed on a vSAN datastore. We will look at a read operation when the stripe width value is set to 2, and we will look at a write operation when the number of failures to tolerate is set to 1 using RAID-1. This will give you an understanding of the underlying I/O flow, and this can be leveraged to get an understanding of the I/O flows when other capability values are specified. We will also discuss the destaging to the capacity layer, as this is where deduplication and compression comes in to play. Before we do, let’s first look at the role of flash in the I/O path.

Caching Algorithms

There are different caching algorithms in place for the hybrid configurations and the all-flash configurations. In a nutshell, the caching algorithm on hybrid configurations is concerned with optimally destaging blocks from the cache tier to the capacity tier, whilst the caching algorithm on all-flash configurations is concerned with ensuring that hot blocks (data that is live) are held in the caching tier while cold blocks (data that is not being accessed) are held in the capacity tier.

The Role of the Cache Layer

As mentioned in the previous section, SSDs (and by that we also mean flash devices) have two purposes in vSAN when they are used in the caching layer on hybrid configurations; they act as both a read cache and a write buffer. This dramatically improves the performance of VMs. In some respect, vSAN can be compared to a number of “hybrid” storage solutions in the market, which also use a combination of SSD and magnetic disks to increase the performance of the I/O, but which have the ability to scale out capacity based on low-cost SATA or SAS magnetic disk drives.

There is no read cache in all-flash vSAN configurations; the caching tier acts as a write buffer only.

Purpose of Read Cache

The purpose of the read cache in hybrid configurations is to maintain a list of commonly accessed disk blocks by VMs. This reduces the I/O read latency in the event of a cache hit; that is, the disk block is in cache and does not have to be retrieved from magnetic disk. The actual block that is being read by the application running in the VM may not be on the same ESXi host where the VM is running. In this case, DOM picks a mirror for a given read (based on offset) and sends it to the correct component. This is then sent to LSOM to find whether the block is in the cache. If it transpires that there is a cache miss, the data are retrieved directly from magnetic disk in the capacity tier, but of course this will incur a latency penalty and could also impact the number of input/output operations per second (IOPS) achievable by vSAN. This is the purpose of having a read cache on hybrid vSAN configurations, as it reduces the number of IOPS that need to be sent to magnetic disks. The goal is to have a minimum read cache hit rate of 90%. vSAN also has a read ahead cache optimization where 1 MB of data around the data block being read is also brought into cache.

vSAN always tries to make sure that it sends a given read request to the same mirror so that the block only gets cached once in the cluster; in other words, it is cached only on one cache device, and that cache device is on the ESXi host that contains the mirror where the read requests are sent. Because cache space is relatively expensive, this mechanism optimizes how much cache you require for vSAN. Correctly sizing vSAN cache has a very significant impact on performance in steady state.

Why is there no read cache in All-Flash vSAN configurations?

In all-flash vSAN configurations, since the capacity layer is also flash, if there would be a read cache miss, fetching the data block from the capacity tier would not be as expensive as fetching a data block from the capacity tier in a hybrid solution which uses spinning disk. Instead, it is actually a very quick (typically sub-millisecond) operation. Therefore, it is not necessary to have a flash-based read cache in all-flash vSAN configurations since the capacity tier can handle reads effectively. By not implementing a read cache, we also free up the cache tier for more writes, boosting overall performance.

Purpose of Write Cache

The write cache behaves as a write-back buffer in both all-flash and hybrid vSAN configurations. Writes are acknowledged when they enter the prepare stage on flash devices used in the cache tier. The fact that we can use flash devices for writes in hybrid configurations reduces significantly the latency for write operations since the writes do not have to be destaged to the capacity tier before they are acknowledged.

Because the writes go to the cache tier flash devices, we must ensure that there is a copy of the data block elsewhere in the vSAN cluster. All VMs deployed to vSAN have an availability policy setting that ensures at least one additional copy of virtual machine data is available, (unless of course administrators explicitly override the default policy and choose a failure to tolerate setting of 0). This availability policy includes the write cache contents. Once a write is initiated by the application running inside of the guest OS, the write is sent to all replicas in parallel. Writes are buffered in the cache tier flash device associated with the disk group where the components of the VMDK storage object reside.

This means that in the event of a host failure, we also have a copy of the in-cache data and so no corruption will happen to the data; the virtual machine will simply reuse the replicated copy of the cache as well as the replicated disk data.

Note that all-flash vSAN configurations continue to use the cache tier as a write buffer, and all virtual machine writes land first on this cache device, same as in hybrid configurations. The major algorithm change here is how the write cache is used. The write cache is now used to hold “hot” blocks of data (data that is in a state of change). Only when the blocks become “cold” (no longer updated/written) they are moved to the capacity tier.

Anatomy of a vSAN Read on Hybrid vSAN

For an object placed on a vSAN datastore, when using a RAID-1 configuration, it is possible that there are multiple replicas when the number of failures to tolerate value is set to a value greater than 0 in the VM storage policy. Reads may now be spread across the replicas. Different reads may be sent to different replicas according to their logical block address (LBA) on disk. This is to ensure that vSAN does not necessarily consume more read cache than necessary, and avoids caching the same data in multiple locations.

Taking the example of an application in the issuing a read request, the cluster service (CMMDS) is first consulted to determine the owner of the data. The owner, using the logical block address (LBA), determines which component will service the request and sends it there. This then goes to LSOM to determine if the block is in read cache. If the block is present in read cache, the read is serviced from that read cache. If a read cache miss occurs, and the block is not in cache, the next step is to read the data from the capacity tier, and on hybrid vSAN configurations, the capacity tier will be made up of magnetic disks.

As mentioned, the owner of the object splits the reads across the components that go to make up that object, so that a given block is cached on at most one node, maximizing the effectiveness of the cache tier. In many cases, the data may have to be transferred over the network if the data is on the storage of a different ESXi host. Once the data are retrieved, it is returned to the requesting ESXi host and the read is served up to the application.

Figure 5.9 gives an idea of the steps involved in a read operation on hybrid vSAN. In this particular example, the stripe width setting is 2, and the VM’s storage object is striped across disks that reside on different hosts. (Each stripe is therefore a component, to use the correct vSAN terminology.) Note that Stripe-1a and Stripe-1b reside on the same host, while Stripe-2a and Stripe-2b reside on different hosts. In this scenario, our read needs to come from Stripe-2b. If the owner does not have the block that the application within the VM wants to read, the read will go over the 10 GbE network to retrieve the data block.

Figure 5.9 - vSAN I/O flow: Failures to tolerate = 1 + stripe width = 2

Anatomy of a vSAN Read on all-flash vSAN

Since there is no read cache in all-flash vSAN clusters, the I/O flow is subtlety different when compared to a read operation on hybrid configurations. On an all-flash vSAN, when a read is issued, the write buffer is first checked to see if the block is present (i.e., is it a hot block?). The same is done on hybrid, FYI. If the block being read is in the write buffer, it will be fetched from there. If the requested block is not in the write buffer, the block is fetched from the capacity tier. Now remember that the capacity tier is also flash in an all-flash vSAN, so the latency overhead in first checking the cache tier, and then having to retrieve the block from the capacity tier is minimal. This is the reason why we have not implemented a read cache for all-flash vSAN configurations, and the cache tier is totally dedicated as a write buffer. By not implementing a read cache, as mentioned earlier, we free up the cache tier for more writes, boosting overall IOPS performance.

Anatomy of a vSAN Write on Hybrid vSAN

Now that we know how a read works, let’s take a look at a write operation. When a new VM is deployed, its components are stored on multiple hosts. vSAN does not have the notion of data locality and as such it could be possible that your VM runs on ESXi-01 from a CPU and memory perspective, while the components of the VM are stored on both ESXi-02 and ESXi-03, as shown in Figure 5.10.

Figure 5.10 - vSAN I/O flow: Write acknowledgment

When an application within a VM issues a write operation, the owner of the object clones the write operation. The write is sent to the write cache (SSD flash device) on ESXi-02 and to the write cache (SSD flash device) on ESXi-03 in parallel. The write is acknowledged when the write reaches the cache, and the prepare operation on the SSD is completed. The owner waits for ACK from both hosts and completes I/O. Later, the write will be destaged as part of a batch commit to magnetic disk. This happens independent from each other. In other words, ESXi-02 may destage writes at a different time than ESXi-03. This is not coordinated because it depends on various things such as how fast the write buffer is filling up, how much capacity is left, and where data are stored on magnetic disks.

Anatomy of a vSAN Write on all-flash vSAN

A write operation on an all-flash vSAN is very similar to how writes are done in hybrid vSAN configurations. In hybrid configurations, only 30% of the cache tier is dedicated to the write buffer, and the other 70% is assigned to the read cache. Since there is no read cache in all-flash configurations, the full 100% of the cache tier is assigned to the write buffer (up to a maximum of 600 GB in the current version of vSAN). However, the authors understand that this limitation should be increased in the near future.

The role of the cache tier is also subtlety different between hybrid and all-flash. As we have seen, the write buffer in hybrid vSAN improves performance since writes do not need to go to the capacity tier made up of magnetic disks, thus improving latency. In all-flash vSAN, the purpose of the write buffer is endurance. A design goal of all-flash vSAN is to place high endurance flash devices in the cache tier so that they can handle the most amounts of I/O. This allows the capacity tier to use a lower specification flash device, and they do not need to handle the same amount of writes as the cache tier.

Having said that, write operations for all-flash are still very similar to hybrid in so far as that only when the block being written is in the write buffer of all of the replicas is the write acknowledged.

Retiring Writes to Capacity tier on Hybrid vSAN

Writes across virtual disks from applications and guest OSs running inside a VM deployed on vSAN accumulates in the flash tier over time. On hybrid vSAN configuration, that is, vSAN configurations that use flash devices for the cache tier and magnetic disks for the capacity tier, vSAN has an elevator algorithm implemented that periodically flushes the data in the write buffer in cache to magnetic disk in address order. The write buffer of the SSD is split into a number of “buckets.” Data blocks, as they are written, are assigned to buckets in increasing LBA (logical block address) order. When destaging occurs, perhaps due to resource constraints, the data in the oldest bucket is destaged first.

vSAN enables write buffering on the magnetic disks to maximize performance. The magnetic disk write buffers are flushed before discarding writes from SSD, however. As mentioned earlier, when destaging writes, vSAN considers the location of the I/O. The data accumulated on a per bucket basis provides a sequential (proximal) workload for the magnetic disk. In other words, the LBAs in close proximity to one another on the magnetic disk are destaged together for improved performance. In fact, this proximal mechanism also provides improved throughput on the capacity tier flash devices for all-flash vSAN configurations.

The heuristics used for this are sophisticated and take into account many parameters such as rate of incoming I/O, queues, disk utilization, and optimal batching. This is a self-tuning algorithm that decides how often writes on the SSD destage to magnetic disk.

Deduplication and Compression

vSAN 6.2 introduces two new data reduction features, deduplication and compression. When enabled on a cluster level, vSAN will aim to deduplicate each block and compress the result before destaging the block to the capacity layer. This feature is only available for all-flash vSAN. Compression and deduplication cannot be enabled separately; they are either disabled or enabled together. Deduplication and compression work on a disk group level. In other words, only objects deployed on the same disk group can contribute toward space savings. If components from different but identical VMs are deployed to different disk groups, there will not be any deduplication of identical blocks of data. However, deduplication and compression is a cluster wide feature—it is either on or off. You cannot choose which virtual machines, or which disk groups, to enable it on.

For components deployed on the same disk group that have deduplication and compression enabled, deduplication will be done to a 4 KB block level. A disk group will only use one copy of that 4 KB block and all duplicate blocks will be eliminated as shown in Figure 5.11.

Figure 5.11 - deduplicating blocks

The process of deduplication is done when the block is being destaged from the cache tier to the capacity tier. To track the deduplicated blocks, hash tables are used. The deduplicated data and hash table metadata are spread across the capacity devices that make up the disk group.

Deduplication does not differentiate between the components in the disk group. It may deduplicate blocks in the VM home namespace, VM swap, VMDK object or snapshot delta object.

If a disk group begins to fill up capacity wise, vSAN examines the footprint of the deduplicated components, and moves the ones which will make the most significant difference to the capacity used in the disk group.

Please note however that if deduplication and compression are enabled on a disk group, a single device failure will make the entire disk group appear unhealthy.

Once the block has been deduplicated, vSAN looks to compress that 4 KB block down to 2 KB or less. If vSAN can compress a block down to <=2 KB it keeps the compressed copy. Otherwise the uncompressed block is kept.

The process for this is relatively straight forward as shown in Figure 5.12. At step 1 the VM writes data to vSAN that lands on the caching tier. When the data become cold and needs to be destaged, vSAN reads the block in to memory (step 2). It will compute the hashes, eliminate the duplicates and compress the remaining blocks before writing it to the capacity tier (step 3).

Figure 5.12 - Deduplication and compression process

For those interested, vSAN currently uses SHA1 for the deduplication hash and uses LZ4 for the compression. Of course, this may change in future releases.

Data Locality

A question that usually comes now is this: What about data locality? Is cache (for instance) kept local to the VM? Does the VM cache and the VMDK storage object need to travel with the VM each time vSphere Distributed Resource Scheduler (DRS) migrates a VM due to a compute imbalance? Prior to vSAN 6.2 the answer for a standard vSAN deployments was no; vSAN does not have the concept of data locality. However, with vSAN 6.2 some changes have been introduced to the architecture. There are different layers of cache, lets list those first to explain where “locality” applies and where not.

  • Flash based write cache
  • Flash based read cache (with hybrid)
  • Memory based read cache

For flash-based caches there is no locality principle. The reason for this is straightforward: Considering that read I/O is at most one network hop away and that the latency incurred on 10 GbE is minimal compared to, for instance, flash latency and even kernel latency, the cost of moving data around simply does not weigh up against the benefits. This is especially true when you consider the fact that by default vSphere DRS runs once every 5 minutes at a minimum which can result in VMs being migrated to a different host every 5 minutes. Considering the cost of flash and the size of flash-based caches, moving around data in these flash based cache tiers is simply not cost effective. vSAN instead focused on load balancing of storage resources across the cluster in the most efficient and optimal way, because this is more beneficial and cost-effective to vSAN.

Having that said, as of 6.2 vSAN also has a small in-memory read cache. Small in this case means 0.4% of a host’s memory capacity, up to a max of 1 GB per host. This in-memory cache is a client side cache, meaning that the blocks of a VM are cached in memory on the host where the VM is located. When the VM migrates, the cache will need to be warmed up again as the client side cache is invalidated. Note though that in most cases hot data already resides in the flash read cache or the write cache layer and as such the performance impact is low.

If there is a specific requirement to provide an additional form of data locality, however, it is good to know that vSAN integrates with CBRC (in memory read cache for VMware View), and this can be enabled without the need to make any changes to your vSAN configuration. Note that CBRC does not need a specific object or component created on the vSAN datastore; the CBRC digests are stored in the VM home namespace object.

Data Locality in vSAN Stretched Clusters

There is a caveat to this treatment of data locality, and it is when considering a vSAN stretched cluster deployment. vSAN stretched clusters were introduced in vSAN 6.1. These clusters allow hosts in a vSAN cluster to be deployed at different, geographically dispersed sites. In a vSAN stretched cluster, one mirror of the data are located at site 1 and the other mirror is located at site 2 (vSAN stretched cluster only supports RAID-1 currently). Previously we mentioned that vSAN implements a sort of round robin policy when it comes to reading from mirrors. This would not be suitable for vSAN stretched clusters as 50% of the reads would need to traverse the link to the remote site. Since VMware supports latency of up to 5 ms between the sites, this would have an adverse effect on the performance of the virtual machine. Rather than continuing to read in a round-robin, block offset fashion, vSAN now has the smarts to figure out which site a VM is running on in a stretched cluster configuration, and change its read algorithm to do 100% of its read from the mirror/replica on the local site. This means that there are no reads done across the link during steady-state operations. It also means that all of the caching is done on the local site, or even on the local host using the in-memory cache. This avoids incurring any additional latency, as reads do not have to traverse the inter-site link. Note that this is not read locality on a per host basis. It is read locality on a per site basis. On the same site, the VM’s compute could be on any of the ESXi hosts while its local data object could be on any other ESXi host.

Storage Policy-Based Management

In the introduction of the book, you learned that storage policy-based management (SPBM) now plays a major role in the policies and automation for VMware’s software-defined storage vision. Chapter 4 covered some of the basics of SPBM in combination with vSAN and showed that by using SPBM administrators could specify a set of requirements for a virtual machine. More specifically, it defines a set of requirements for the application running in the virtual machine. This set of requirements is pushed down to the storage layer, and the storage layer now checks if the storage objects for this virtual machine can be instantiated with this set of requirements. For instance, are there an available number of physical disks to meet the number of stripe widths if this is a requirement placed in the policy? Or are there enough hosts in the cluster to meet the number of failures to tolerate if this is a requirement placed in the policy? If the requirements are understood, the vSAN datastore is said to be a matching resource, and is highlighted as such in the provisioning wizard. Later, when the virtual machine is deployed, administrators can check if the virtual machine is compliant from a storage perspective in its own summary window. If the vSAN datastore is overcommitted, or cannot meet the striping performance requirement, it is not shown as a matching resource in the deployment wizard. If the virtual machine is still deployed to the vSAN datastore even though it shows up as not matching (perhaps through force provisioning), the VM will be displayed as noncompliant in its summary window. The VM may also show up as noncompliant if a failure has occurred in the cluster, and the policy requirements can no longer be met, such as number of failures to tolerate.

To summarize, SPBM provides an automated policy-driven mechanism for selecting an appropriate datastore for virtual machines in a traditional environment based on the requirements placed in the VM’s storage policy. Within a vSAN-enabled environment, SPBM determines how virtual machines are provisioned and laid out.

Let’s now take a closer look at the concept of SPBM for vSAN-based infrastructures.

vSAN Capabilities

This section examines the vSAN capabilities that can be placed in a VM storage policy. These capabilities, which are surfaced by the VASA provider for the vSAN datastore when the cluster is configured successfully, highlight the availability and performance policies that can be set on a per-object basis. We explicitly did not say virtual machine as policies can be set at a per virtual disk granularity.

If you overcommit on the capabilities (i.e., put a capability in the policy that cannot be met by the vSAN datastore), the vSAN datastore will no longer appear as a matching resource during provisioning, and the VM will also show noncompliant in its summary tab even if you successfully deploy.

While on the subject of vSAN capabilities, let’s revisit them and discuss them in more detail.

Number of Failures to Tolerate Policy (FTT) Setting

In Chapter 4, we looked at which of the VM storage policy settings affect VM storage objects. With that in mind, let’s look at number of failures to tolerate (FTT for short) in greater detail; this is probably the most used capability vSAN has to offer.

This capability sets a requirement on the storage object to tolerate at least n failures. In this case, n refers to the number of concurrent host, network, or disk failures that may occur in the cluster while still ensuring the availability of the VM’s storage objects, and thus allowing the virtual machine to continue to run or be restarted by vSphere HA depending on the type of failure that occurred. If this property is populated in the VM storage policy, and RAID-1 is the chosen configuration over RAID-5 and RAID-6, then the storage objects must contain at least n+1 replicas.

Note that a virtual machine will remain accessible on a vSAN datastore only as long as its storage objects remains available. To recap what has been discussed before, a virtual machine deployed on the vSAN datastore will have a number of storage objects such as VM home namespace, VM swap, VMDK, snapshot deltas and snapshot memory. For the virtual machine to remain accessible, more than 50% of the components that make up the VM’s storage objects must be available.

Let’s take a simple policy example to clarify. If you deploy a virtual machine that has number of failures to tolerate = 1 as the only policy setting (implying a RAID-1 configuration), you may see the VMDK storage object deployed, as shown in Figure 5.13.

Figure 5.13 - Simple physical disk placement example with failures to tolerate = 1

It is important to understand that storage objects are made up of components. There are two components making up the RAID-1 mirrored storage object—one on host 17 and the other on host 19. These are the mirror replicas of the data that make failure tolerance possible. But what is this witness component on host 18? Well, remember that 50% of the components (votes) must remain available. In this example, if you didn’t have a witness and host 17 failed, you would lose one component (50%). Even though you still had a valid replica available, more than 50% of the components (votes) would not be available. This is the reason for the witness disk. The witness is also used to determine who is still in the cluster in the case of split-brain scenarios.

The witness itself is essentially a piece of metadata; it is only about 4 MB in size, and so it doesn’t consume a lot of space. As you create storage objects with more and more components, additional witnesses may get created, as shown in Figure 5.14. This is entirely dependent on the RAID-1 configuration and how vSAN decides to place components.

Figure 5.14 - Additional witnesses may get created

Now that you understand the concept of the witness, the next question is this: How many hosts do you need in the vSAN cluster to accommodate n number of failures to tolerate with RAID-1? Table 5.1 outlines this solution.

Number of Failures to Tolerate (n) Number of RAID-1 Replicas (n + 1) Number of Hosts Required in vSAN Cluster (2n + 1)
1 2 3
2 3 5
3 4 7

Table 5.1 - Relationship Between Failures to Tolerate, Replicas, and Hosts Required

If you try to specify a number of failures to tolerate value that is greater than that which can be accommodated by the vSAN cluster, you will not be allowed to do so. Figure 5.15 depicts an example of trying to set number of failures to tolerate to 2 in a three-node cluster.

Figure 5.15 - Failures to tolerate requires a specific number of hosts

Note that the other RAID types introduced in vSAN 6.2, namely RAID-5 and RAID-6, have a pre-defined number of hosts and a pre-defined number of failures to tolerate, as shown in Table 5.2.

RAID Configuration Number of Failures to Tolerate (n) Number of Hosts Required in vSAN Cluster
RAID-5 1 4
RAID-6 2 6

Table 5.2 - Relationship Between Failures to Tolerate and Hosts Required with RAID-5 and RAID-6

Best Practice for Number of Failures to Tolerate

The recommended practice (with RAID-1 configurations) for number of failures to tolerate is 1, unless you have a pressing concern to provide additional availability to allow VMs to tolerate more than one failure. Note that increasing number of failures to tolerate would require additional disk capacity to be available for the creation of the extra replicas.

vSAN has multiple management workflows to warn/protect against accidental decommissioning of hosts that could result in vSAN’s being unable to meet the number of failures to tolerate policy of given VMs. This includes a noncompliant state being shown in the VM summary tab.

Then the question arises: What is the minimal number of hosts for a vSAN cluster? In vSAN 6.1, a new 2-node configuration for remote office/branch office solutions was introduced, but this also required a witness appliance. If we omit the 2-node configuration for the moment, customers would for the most part require three ESXi hosts to deploy vSAN. However, what about scenarios where you need to do maintenance and want to maintain the same level of availability during maintenance hours?

To comply with a number of failures to tolerate = 1 policy, you need three hosts at a minimum at all times. Even if one host fails, you can still access your data, because with three hosts and two mirror copies and a witness, you will still have more than 50% of your components (votes) available. But what happens when you place one of those hosts in maintenance mode, as shown in Figure 5.16.

Figure 5.16 - vSAN: Minimum number of hosts

When both remaining hosts keep functioning as expected, all VMs will continue to run. If another host fails or gets placed into maintenance mode, you have a challenge as at this point the remaining host will have less than 50% of the components of your VM. As a result, VMs cannot be restarted (nor do any I/O) because the component will not have an owner.

Stripe Width Policy Setting

The second most popular capability is definitely number of disk stripes per object. We will refer to this as stripe width for the purposes of keeping things simple. The first thing to discuss is when striping can be beneficial to virtual machines in a vSAN environment. The reason we are bringing this up is that you need to be aware that all I/O in vSAN goes to the cache tier that is made up of flash. To be more correct, all writes go to the write buffer on flash, and on hybrid configurations, we try to service all reads from the read cache on the cache tier, too. Some reads may have to be serviced by magnetic disk if the data are not in cache (read cache miss), however.

So, where can a stripe width increase help?

  • It may be possible to improve write performance of a VM’s storage objects if they are striped across different disk groups or indeed striped to a capacity device located on another host.
  • When there are read cache misses in hybrid configurations.
  • Possible performance improvements during destaging of blocks from cache tier flash devices to magnetic disk in hybrid configurations.

Let’s elaborate on these a bit more.

Performance: Writes

Because all writes go to the cache tier (write buffer), the value of an increased stripe width may or may not improve performance for your virtual machine I/O. This is because there is no guarantee that the new stripe will use a different cache tier device; the new stripe may be placed on a capacity device in the same disk group, and thus the new stripe will use the same cache tier. This is applicable to both all-flash and hybrid configurations.

There are three different scenarios for stripes:

  • Striping across hosts: Improved performance with different cache tier flash devices
  • Striping across disk groups: Improved performance with different cache tier flash devices
  • Striping in the same disk group: No improved performance (using same cache tier flash device)

At the moment, vSAN does not have data gravity/locality of reference outside of vSAN stretched clusters, so it is not possible to stipulate where a particular component belonging to a storage object should be placed. This is left up to the vSAN component placement algorithms, which try to place storage components down on disk in a balanced fashion across all nodes in the cluster.

Therefore, in conclusion, adding a stripe width may not result in any improved write performance for VM I/O, but allows for the potential of improved performance.

Performance: Read Cache Misses in hybrid configurations

Let’s look at the next reason for increasing the stripe width policy setting; this is probably the primary reason for doing so. In situations where the data set of a virtual machine is too big or the workload is so random that the read cache miss rates can overwhelm the throughput of a single magnetic disk in a hybrid configuration, it can be beneficial to ensure that multiple magnetic disks are used when reading. This can be achieved by setting a stripe width.

From a read perspective, an increased stripe width helps when you are experiencing many read cache misses. If you consider the example of a virtual machine consuming 2,000 read operations per second and experiencing a hit rate of 90%, there are 200 read operations that need to be serviced from magnetic disk. In this case, a single magnetic disk that can provide 150 IOPS cannot service all of those read operations, so an increase in stripe width would help on this occasion to meet the VM I/O requirements.

How can you tell whether you have read cache misses? In early versions of vSAN, this information was not available within vSphere Web Client. However the RVC vSAN observer tool provides a lot of detailed information, including read cache hit rate, as shown in Figure 5.17. In this case, the read cache hit rate is 100%, meaning that there is no point in increasing the stripe width, because all I/O is being served by flash.

Figure 5.17 - 100%, no evictions

In vSAN 6.2, a new performance service was introduced that placed information regarding the performance of the vSAN cluster in the vSphere web client. Now you no longer need to launch tools such as RVC to examine this information. In Figure 5.18, there is a 0% read cache hit rate. Either this is a very idle system, or it is an all-flash vSAN that does not use read cache. In fact, it is both.

Figure 5.18 - Performance service

To summarize, if read cache misses were occurring in your vSAN, increasing the stripe width may improve the performance for I/O that needs to be serviced from magnetic disk, if a single spindle was not enough to handle the requests. The RVC, vSAN observer and the new performance service is discussed in detail in Chapter 10.

Performance: Cache Tier Destaging

The final reason for increasing the stripe width relates to destaging blocks from cache to magnetic disk. There is an important consideration regarding cache destaging: What sorts of workloads are running on vSAN?

By way of an example, if you are doing virtual desktop infrastructure (VDI) deployment and you have hundreds of virtual machines, then it is likely that vSAN needs to destage to all parts of the capacity tier. Changing the stripe width on the policy may not help since it is likely that all of the magnetic disks are already in use through destaging.

In a situation where there are 99 virtual machines doing mostly reads and almost no writes, there will be very little destaging. So increasing the stripe width will have little benefit. However, if there are few virtual machines doing a very large number of writes and thus incurring a lot of destaging, then a performance improvement might be expected during destage if those few virtual machines are configured with a higher stripe width.

How can you tell whether your cache tier has lots of blocks to be destaged? In earlier versions of vSAN, administrators need to use the vSAN observer tool, available with vCenter Server 5.5 U1 and later. The vSAN observer tool is part of the RVC, and the view shown in Figure 5.19 is taken from the vSAN disks (deep-dive): Device-level stats option. Once again, Chapter 10 covers both RVC and the vSAN observer in detail.

Figure 5.19 - vSAN observer write buffer fill info

With the availability of vSAN 6.2 and the new performance service, this information is now readily available in the vSphere Web Client. Figure 5.20, taken once again from the disk group view shows the write buffer free percentage.

Figure 5.20 - Write buffer free percentage via the performance service

RAID-0 Used When no Striping Specified in the Policy

Those who have been looking at the Web Client regularly, where you can see the placement of components, may have noticed that vSAN appears to create a RAID-0 for your VMDK even when you did not explicitly ask it to. Or perhaps you have requested a stripe width of two in your policy and then observed what appears to be a stripe width of three being created. vSAN will split objects as it sees fit when there are space constraints. This is not striping per se, since component can end up on the same physical capacity device. Instead we can refer to it as chunking. vSAN will use this chunking method on certain occasions. The first of these is when a virtual disk (VMDK) is larger than any single chunk of free space. Essentially, vSAN hides the fact that even when there are small capacity devices on the hosts, administrators can still create very large VMDKs. Therefore, it is not uncommon to see large VMDKs split into multiple components, even when no stripe width is specified in the VM storage policy. vSAN will use this chunking method when a virtual disk (VMDK) is larger than any capacity device.

There is another occasion where this chunking may occur. By default, an object will also be split if its size is greater than 255 GB (the maximum component size). An object might appear to be made up of multiple 255 GB chunks even though striping may not have been a policy requirement. It can be split even before it reaches 255 GB when free disk space makes vSAN think that there is a benefit in doing so. Note that just because there is a standard split at 255 GB, it doesn’t mean all new chunks will go onto different magnetic disks. In fact, since this is not striping per se, multiple chunks may be placed on the same physical capacity device. It may, or may not, depending on overall balance and free capacity.

If the capacity devices are smaller than 255 GB, then this chunking threshold can be modified via the advanced parameter vSAN.ClomMaxComponentSizeGB. (Consult the official vSAN documentation or VMware support before changing this ESXi advanced setting.)

Let’s look at what some of our tests have shown as that may make things a bit clearer for you.

Test 1

On a hybrid configuration, we created a 150 GB VM on a vSAN datastore (which had 136 GB magnetic disks as its capacity devices) with a policy of number of failures to tolerate (FTT) = 1. We got a simple RAID-1 for our VMDK with two components, each replica having just one component (so no RAID-0). Now this is because the VM is deployed on the vSAN datastore as thin by default, so even though we created a 150 GB VM, because it is thin it can sit on a single 136 GB magnetic disk, as demonstrated in Figure 5.21.

Figure 5.21 - Simple physical disk placement: Single witness component

Test 2

On the same cluster, we created a 150 GB VM, with a policy of FTT = 1 and object space reservation (OSR) = 100%. Now we get another RAID-1 of my VMDK, but each replica is made up of a RAID-0 with two components. OSR is essentially specifying a thickness for the VM. Because we are guaranteeing space, our VM needs to span at least two magnetic disks, and therefore a stripe is being used.

Test 3

We created a 300 GB VM, with a policy of FTT = 1, OSR = 100%, and stripe width (SW) = 2. We got another RAID-1 of our VMDK as before, but now each replica is made up of a RAID-0 with three components. Here, even with a SW = 2 setting, my VMDK requirement is still too large to span two magnetic disks. A third capacity device is required in this case, as shown in Figure 5.22.

We can conclude that multiple components in a RAID-0 configuration are used for VMDKs that are larger than a single magnetic disk, even if a stripe width is not specified in the policy.

Figure 5.22 - More complex deployment: Multiple witness components needed

Stripe Width Maximum

In vSAN, the maximum stripe width that can be defined is 12. This can be striping across magnetic disks in the same host, or across hosts. Remember that when you specify a stripe width and failures to tolerate (FTT) value, and the failure tolerance method is left at the default of performance for RAID-1, there has to be at least stripe width (SW) × (FTT+1) number of capacity devices before vSAN is able to satisfy the policy requirement, ignoring for the moment that additional devices will be needed to host the witness component(s). This means that the larger the number of FTT and SW, the more complex the placement of object and associated components will become. The number of disk stripes per object setting in the VM storage policy means stripe across “at least” this number of magnetic disks per mirror.” vSAN may, when it sees fit, use additional stripes.

Figure 5.23 shows a screenshot taken for a VM storage policy screen, with the information icon selected for further details. The reference to HDD in the help screen actually stands for hard disk drive, what we have been calling magnetic disks in this book. Of course, HDDs are only used in hybrid configurations, so this is taken from vSAN 5.5, before support for all-flash was introduced. However, the guidance is the same for both models. Note that RAID-5 and RAID-6 can also implement a RAID-0 stripe width. Each chunk of the RAID-5 or RAID-6 object may be striped across multiple capacity tier devices, if stripe width is specified in the VM Storage policy, and failure tolerance method is set to capacity.

Figure 5.23 - Number of disk stripes per object

Stripe Width Configuration Error

You may ask yourself what happens if a vSphere administrator requests the vSAN cluster to meet a stripe width policy setting that is not available or achievable. Figure 5.24, taken from the first release of vSAN, shows the resulting error. Basically, the deployment of the VM fails, stating that there were not sufficient capacity devices found to meet the requirements of the defined policy. A similar error is displayed in all versions of vSAN.

Figure 5.24 - Task failed: Only found 0 such disks

Stripe Width Chunk Size

A question that then often arises after the stripe width discussion is whether there is a specific segment size. In other words, when the stripe width is defined using the VM storage policies, which increment do the components use to grow? vSAN uses a stripe segment size of 1 MB. As depicted in Figure 5.25, “1 MB stripe segment 1” will go to ESXi-02, and when the next 1 MB is written to, the “1 MB stripe segment 2” will go to ESXi-03, and so on.

Figure 5.25 - vSAN I/O flow: 1 MB increment striping

Stripe Width Best Practice

After reading this section, you should more clearly understand that increasing the stripe width could potentially complicate placement. vSAN has a lot of logic built in to handle smart placement of objects. We recommend not increasing the stripe width unless you have identified a pressing performance issue such as read cache misses or during destaging.

Remember that all I/O should go to the flash layer first. All writes certainly go to flash first, and in the case of hybrid configurations, are later destaged to magnetic disks. For reads, the operation is first attempted on the flash layer. If a read cache miss occurs, the block is read from magnetic disk in hybrid configurations and from flash capacity devices in all-flash configurations. Therefore, if all of your reads are being satisfied by the cache tier, there is no point in increasing the stripe width as it does not give any benefits. Doing the math correctly beforehand is much more effective, leading to proper sizing of the flash-based cache tier, rather than trying to increase the stripe width after the VM has been deployed!

Flash Read Cache Reservation Policy Setting

This policy setting, applicable to hybrid configurations only, is the amount of flash capacity reserved on the cache tier as read cache for the storage object. It is specified as a percentage of the logical size of the storage object, up to four decimal places. This fine granular unit size is needed so that administrators can express sub 1% units. Take the example of a 1 TB VMDK. If we limited the read cache reservation to 1% increments, this would mean cache reservations in increments of 10 GB, which in most cases is far too much for a single VM.

Note that you do not have to set a reservation to get cache. The reservation should be set to 0, meaning it is equally shared amongst all virtual machines, unless you are trying to solve a real performance problem for a particular VM or group of VMs.

There is no proportional share mechanism for this resource, which is what vSphere administrators will be familiar with from other vSphere features.

Object Space Reservation Policy Setting

All objects deployed on vSAN are thin provisioned by default. The object space reservation capability defines the percentage of the logical size of the storage object that may be reserved during initialization. To be clear, the OSR is the amount of space to reserve specified as a percentage of the total object address space. You can use this property for specifying what is akin to a thick provisioned storage object, albeit it is not quite the same.

If OSR is set to 100%, all the storage capacity requirements of the VM are reserved up front. Note that although the object itself will still be thin provisioned, the space it can claim is reserved explicitly for this object so that the vSAN datastore cannot run out of disk space for this VM.

Note that if deduplication and compression are enabled as data reduction features in vSAN 6.2, OSR may only be set to 0% or 100%. Values of between 1% and 99% cannot be used.

You may remember that certain storage objects (notably VM namespace and VM Swap) do not adhere to certain policy settings. One of these is object space reservation. Let’s take a look at these two special objects.

VM Home Namespace Revisited

The VM namespace on vSAN is a 255 GB thin object. The namespace is a per-VM object. As you can imagine, if we allocated policy settings to the VM home namespace, such as proportional capacity and flash read cache reservation, much of the magnetic disk and flash resources could be wasted. To that end, the VM home namespace has its own special policy, as follows:

  • Number of disk stripes per object: 1
  • Number of failures to tolerate: <as-per-policy>
  • Flash read cache reservation: 0%
  • Force provisioning: Off
  • Object space reservation: 0% (thin)
  • Failure tolerance method: <as-per-policy>
  • Checksum disabled: <as-per-policy>
  • IOPS limit for object: <as-per-policy>

The number of failures to tolerate policy setting is the only setting inherited from the VM storage policy. So, if a customer creates a VM storage policy with an FTT setting, the VM home namespace will be deployed with the ability to tolerate the number failures specified in that policy.

VM Swap Revisited

The VM swap follows much the same conventions as the VM home namespace. It has the same default policy as the VM home namespace, which is 1 disk stripe, and 0% read cache reservation. However, it does have a 100% object space reservation.

Note that the VM swap has an FTT setting of 1. It does not inherit the FTT from the VM storage policy.

  • Number of disk stripes per object: 1
  • Number of failures to tolerate: 1
  • Flash read cache reservation: 0%
  • Force provisioning: On
  • Object space reservation: 100%
  • Failure tolerance method: Performance
  • Checksum disabled: No (e.g., enabled)
  • IOPS limit for object: Disabled

There is one additional point in relation to FTT, and that is it has force provisioning set to 1. This means that if some of the policy requirements cannot be met, such as number of failures to tolerate, the VM swap is still created.

VM swap is also not limited in the way the VM home namespace is limited to a 255 GB thin object.

The default values for VM swap are not overridden by VM storage policy entries either. However, as highlighted previously, the advanced parameter SwapThickProvisionDisabled can be used to make the VM swap object deployed as thin rather than fully reserved (thick).

The swap object is always deployed as a RAID-1 configuration. It is never deployed with RAID-5 or RAID-6, even if failure tolerance method is set to capacity in the policy associated with the VM. However, having said that, this could be overridden by modifying the default policy for swap via esxcli commands. However, we don’t expect many readers will want to do this. If you really have a desire to do something like this for the VM swap, refer to the official VMware documentation.

How to Examine the VM Swap Storage Object

As we have seen, the VM swap is one of the objects that make up the set of VM objects, along with the VM home namespace, VMDKs, snapshot delta and snapshot memory. Unfortunately in versions of vSAN prior to 6.2, you will not see the VM swap file represented in the list of VM objects in the vSphere web client. This leads inevitably to the question regarding how you go about checking and verifying the policy and resource consumption of a VM’s swap object.

This is in fact quite tricky, because even if you try to use the RVC command, vsan.vm_object_info, you only get information about VM home namespace, VMDKs, deltas and snapshot memory. Chapter 10 covers RVC commands in extensive detail. Again, there is no information displayed for the VM swap. To get information about the VM swap, you first of all have to retrieve the UUID information from the VM’s swap descriptor file. One way of doing this is to SSH on to an ESXi host that is participating in the vSAN cluster and use the cat command in the ESXi shell to display the contents of the VM swap descriptor file. You are looking for the objectID entry. Here is an example:

# cat Auto-Perf-Tool-0b43d77a.vswp
# Object DescriptorFile
version = "1"

objectID = "vsan://4f386056-1874-0983-b93e-ecf4bbd58600"
object.class = "vmswap"

Once you have the descriptor, this can then be used in RVC to display information about the actual swap object. The command to do this is vsan.object_info. This RVC command takes two arguments. The first argument shown in Example 5.1 is the cluster, and the second argument is the UUID:

Example 5.1 Using vsan.object_info to Display Information about the Swap Object

/vcsa-05/Datacenter/computers> ls 
0 vSAN (cluster): cpu 134 GHz, memory 306 GB 
/vcsa-05/Datacenter/computers> vsan.object_info 0 4f386056-1874-0983-b93e-ecf4bbd58600

DOM Object: 4f386056-1874-0983-b93e-ecf4bbd58600 
(v4, owner: esxi-b-pref.rainpole.com, policy: hostFailuresToTolerate = 1, forceProvisioning = 1, proportionalCapacity = 100)

RAID_1

Component: 4f386056-4f12-6a83-25fd-ecf4bbd58600 (state: ACTIVE (5), host: esxi-b-pref.rainpole.com, md: naa.500a07510f86d686, ssd: t10.ATA**_**Micron_P420m2DMTFDGAR1T4MAX\__ votes: 1, usage: 8.0 GB\)

Component: 4f386056-f50d-6b83-7493-ecf4bbd58600 \(state: ACTIVE (5), host: esxi-a-pref.rainpole.com, md: naa.500a07510f86d693, ssd: t10.ATA**_**Micron_P420m2DMTFDGAR1T4MAX\__ votes: 1, usage: 8.0 GB)

Witness: 4f386056-4fda-6b83-b6e1-ecf4bbd58600 (state: ACTIVE (5), host: esxi-c-scnd.rainpole.com, md: naa.500a07510f86d6cf, ssd: t10.ATA**_**Micron_P420m2DMTFDGAR1T4MAX**_**_ votes: 1, usage: 0.0 GB)

Extended attributes:

Address space: 8589934592B (8.00 GB)

Object class: vmswap

Object path: /vmfs/volumes/vsan:5204d7ce46d435a6-81ba22b8f602a826/07386056-e16e-d37d-6536-ecf4bbd58600/Auto-Perf-Tool-0b43d77a.vswp

Object capabilities: NONE

Now that we have the VM swap object info, we can see a number of things:

  • hostFailuresToTolerate for VM Swap is set to 1. This gives us a RAID-1 (mirror) configuration for VM Swap.
  • forceProvisioning for VM Swap is set to 1. This means that even if the current policy cannot be met we should always provision the VM swap object.
  • ProportionalCapacity for VM swap is set to 100%. This means that the space needed for swap is indeed fully reserved.

What can be deduced from this is that from a space utilization standpoint, the VM swap of a VM deployed on vSAN will consume (Configured memory–Memory reservation) * (FTT + 1) amount of space on disk. In most environments, this basically means that on disk you will consume twice the provisioned VM memory by default, because the majority of customers do not set reservations.

Of course, the plan is to eventually get this information more easily accessible, but for now, this method should allow you to gather this information should you need it. The VM swap (.vswp) is an important consideration when sizing your vSAN storage, so make sure to consider this. Chapter 9 provides a formula to do so. The new capacity views in vSAN 6.2 can also provide an insight into how much space is being consumed by the various objects that got to make up a virtual machine deployed on vSAN, including the VM swap objects.

Delta Disk / Snapshot Caveat

For the most part, a delta VMDK (or snapshot, as it’s often referred to) will always inherit the policy associated with the base disk. In vSAN, a vSphere administrator can also specify a VM storage policy for a linked clone. In the case of linked clones, the policy is applied just to the linked clone (top-level delta disk), not the base disk. This is not visible through the UI, however. Both VMware Horizon View and VMware vCloud Director use this capability through the vSphere API.

Now that you know that you can reserve space and disks are thin provisioned, you are probably wondering where you can find out how much space a VM consumes and how much is reserved.

Verifying How Much Space Is Actually Consumed

When you select the vSAN datastore in the UI, then the monitor tab, and then click storage reports, you can get a nice view of how much space is being consumed by each of the VMs, as illustrated in Figure 5.26. Note that the default view does not automatically show these columns. You have to add these columns to the view via the user interface.

Figure 5.26 - How much space is being consumed on the vSAN datastore

There are a few interesting pieces of output here with regard to object space reservation (OSR). As stated, all VMs deployed to vSAN are thin in nature. In the example in Figure 5.26, we deployed a VM called 150 GB VM that did not use OSR. You can see that the size of the virtual disk for this VM is 0 bytes.

In a second example, we deployed a VM called 150 GB-OSR-VM, which has 100% OSR. In this example, you can see that the virtual disk size is 150 GB since vSAN has reserved (but not consumed) 100% of the space required by this object.

Force Provisioning Policy Setting

We have already mentioned this capability various times: force provisioning. If this parameter is set to a non-zero value, the object will be provisioned even if the policy specified in the VM storage policy is not satisfied by vSAN. However, if there is not enough space in the cluster to satisfy the reservation requirements of at least one replica, the provisioning will fail even if force provisioning is turned on!

Now that we know what the various capabilities do, let’s take a look how vSAN leverages these in failure scenarios.

Witnesses and Replicas: Failure Scenarios

Failure scenarios are often a hot topic of discussion when it comes to vSAN. What should one configure, and how do we expect vSAN to respond? This section runs through some simple scenarios to demonstrate what you can expect of vSAN in certain situations.

The following examples use a four-host vSAN cluster and use a RAID-1 mirroring configuration. We will examine various number of failures to tolerate and stripe width settings and discuss the behavior in the event of a host failure. You should understand that the examples shown here are for illustrative purposes only. These are simply to explain some of the decisions that vSAN might make when it comes to object placement. vSAN may choose any configuration as long as it satisfies the customer requirements (i.e., number of failures to tolerate and stripe width). For example, with higher numbers of number of failures to tolerate and stripe width, we have placement choices to use more or less witnesses and more or less hosts than shown in the examples that follow.

Example 1: Number of Failures to Tolerate = 1 and Stripe Width = 1

In this first example, the stripe width is set to 1. Therefore, there is no striping per se, simply a single instance of the object. However, the requirements are that we must tolerate a single disk or host failure, so we must instantiate a replica (a RAID-1 mirror of the component). However, a witness is also required in this configuration to avoid a split-brain situation. A split-brain is when ESXi-01 and ESXi-03 continue to operate, but no longer communicate to each another. Whichever of the hosts can communicate with the witness is the host that has the valid copy of the data in that scenario. Data placement in these configurations may look like Figure 5.27.

Figure 5.27 - Number of failures to tolerate = 1

In Figure 5.27, the data remains accessible in the event of a host or disk failure. If ESXi-04 has a failure, ESXi-02 and ESXi-03 continue to provide access to the data as a quorum continues to exist. However, if ESXi-03 and ESXi-04 both suffer failures, there is no longer a quorum, so data becomes inaccessible. Note that in this scenario the VM is running, from a compute perspective, on ESXi-01, while the components of the objects are stored on ESXi-02/03/04.

Example 2: Number of Failures to Tolerate = 1 and Stripe Width = 2

Turning to another example, this time the stripe width is increased to 2. This means that each component must be striped across two spindles at minimum; however, vSAN may decide to stripe across magnetic disks on the same host or across magnetic disks on different hosts. Figure 5.28 shows one possible distribution of storage objects.

Figure 5.28 - Number of failures to tolerate = 1 and stripe width = 2

As you can see, vSAN in this example has chosen to keep the components for the first stripe (RAID-0) on ESXi-01 but has placed the components for the second stripe across ESXi-02 and ESXi-03. Once again, with number of failures to tolerate set to 1, we mirror using RAID-1. In this configuration, a witness is also used. Why might a witness be required in this example? Consider the case where ESXi-01 has a failure. This impacts both the components on ESXi-01. Now we have two components failed and two components still working on ESXi-02 and ESXi-03. In this case, we still require a witness to attain quorum.

Note that if one component in each of the RAID-0 configuration fails, the data would be inaccessible because both sides of the RAID-1 are impacted. Therefore, a disk failure in ESXi-01 and a disk failure in ESXi-02 can make the VM inaccessible until the disk faults are rectified. Because a witness contains no data, it cannot help in these situations. Note that this is more than one failure, however, and our policy is set to tolerate only one failure.

Example 3: Number of Failures to Tolerate = 2 and Stripe Width = 2

In this last example, the number of failures to tolerate is set to 2, meaning another replica is required. And because each replica is made up of two striped components, an additional two components must be deployed on the vSAN datastore. Again, a possible deployment might look like Figure 5.29.

Figure 5.29 - Number of failures to tolerate = 2 and stripe width = 2

The components per stripe width have been explained previously, and should be clear. Similarly, the fact that there is now a third RAID-0 replica configured should also be self-explanatory at this stage. But what about the fact that there are now three witnesses? Well, consider the situation where both ESXi-02 and ESXi-05 suffer a failure. In that case, four components are lost. To have a quorum majority, the two remaining components would require five objects to achieve a majority. This is why there are three witnesses in this configuration. Losing two hosts would still allow the data to be accessible! Note that this behavior is what one would expect with the initial release of vSAN. vSAN 6.0 introduced the concept of votes, where components may have a vote. With this voting mechanism, it is possible that less witness components are needed. However the scenario described above would still hold true, with witness components replaced by component votes.

Data Integrity through Checksum

A new enhancement in vSAN 6.2 is the introduction of software checksum. While certain vSAN ready nodes provided support for checksum via certain specific hardware devices in previous versions of vSAN, 6.2 is the first release to provide it via software. This requires on-disk format V3, which is a new on-disk format with vSAN 6.2.

This feature is policy driven on a per VM/VMDK basis, but is enabled by default. Administrators will have to create a specific policy that disables checksum (set Disable object checksum to Yes) if they do not want their objects to leverage the feature. Such a policy setting is shown in Figure 5.30.

Figure 5.30 - Checksum policy setting

Checksum is available on both hybrid configurations and all-flash configurations. The checksum data goes all the way through the vSAN I/O stack. For each 4 KB block of data, a 5-byte checksum is created and is stored separately from the data, this occurs before any data are written to persistent storage. In other words, the checksum is calculated before writing the block to the caching layer.

If a checksum error is discovered in the I/O path, the checksum error is automatically repaired. A message stating that a checksum error has been detected and corrected is logged in the VMkernel log.

Checksum also includes a data scrubber mechanism, which validates the data and checksums once every 7 days that will protect your vSAN environment for instance against data corruption as a result of bit rot. The checksum mechanism also leverages accelerated hardware instructions (Intel c2c32c) that makes the process extremely fast.

Recovery from Failure

When a failure has been detected, vSAN will determine which objects had components on the failed device. These failing components will then get marked as either degraded or absent, but the point is that I/O flow is renewed instantaneously to the remaining components in the object. Depending on the type of failure, vSAN will take immediate action or wait for some period of time (60 minutes). The distinction here is if vSAN knows what has happened to a device. For instance when a host fails vSAN typically does not know why this happened, or even what has happened exactly. Is it a host failure, a network failure, is it transient or permanent, and so on. It may be something as simple as a reboot of the host in question. Should this occur, the affected components are said to be in an “absent” state and the repair delay timer starts counting down. If a device such as a disk or SSD reports a permanent error, it is marked as degraded and it is re-protected immediately by vSAN (replacement components are built and synchronized).

In this case, let’s say that we have suffered a permanent host failure in this scenario.

As soon as vSAN realizes the component is absent, a timer of 60 minutes will start. If the component comes back within those 60 minutes, vSAN will synchronize the replicas. If the component doesn’t come back, vSAN will create a new replica, as demonstrated in Figure 5.31.

Figure 5.31 - Host failure: 60-minute delay

Note that you can decrease this time-out value by changing the advanced setting called vSAN.ClomRepairDelay on each of your ESXi hosts in the Advanced Settings section. Caution should be exercised however, because if it is set to a value that is too low, and you do a simple maintenance task such as rebooting a host, you may find that vSAN starts rebuilding new components before the host has completed its reboot cycle. This adds unnecessary overhead to vSAN, and could have an impact on the overall performance of the cluster.

If you want to change this advanced setting, we highly recommend ensuring consistency across all ESXi hosts in the cluster by scripting the required change and to monitor on a regular basis for consistent implementation to avoid inconsistent behavior. The vSAN health check will also verify that the value is set consistently across all hosts in the cluster. (Consult the official vSAN documentation or VMware support before changing this ESXi advanced setting.)

As mentioned, in some scenarios vSAN responds to a failure immediately. This depends on the type of failure and a good example is a magnetic disk or flash device failure. In many cases, the controller or device itself will be able to indicate what has happened and will essentially tell vSAN that it is unlikely that the device will return within a reasonable amount of time. vSAN will then respond by marking all impacted components (VMDK in Figure 5.32) as “degraded,” and vSAN immediately creates a new mirror copy.

Figure 5.32 - Disk failure: Instant mirror copy

Of course, before it will create this mirror vSAN will validate whether sufficient resources exist to store this new copy.

If a recovery occurs before the 60 minutes has elapsed or before the creation of the replica has completed, vSAN will continue to complete the creation of the replica. Once this is complete and the new replica is healthy, the old components are discarded. All of this falls under the concept reconfiguration.

Reconfiguration can take place on vSAN for a number of reasons. First, a user might choose to change an object’s policy and the current configuration might not conform to the new policy, so a new configuration must be computed and applied to the object. Second, a disk or node in the cluster might fail. If an object loses one of the components in its configuration, it may no longer comply with its policy.

Reconfiguration is probably the most resource-intensive task because a lot of data will need to be transferred in most scenarios. To ensure that regular VM I/O is not impacted by reconfiguration tasks, vSAN has the ability to throttle the reconfiguration task to the extent that it does not impact the performance of VMs.

Problematic Device Handling

vSAN 6.1 introduced a new feature called problematic device handling. There was a significant driving factor behind such a feature, and it is to deal with situations where either an SSD or a magnetic disk drive was misbehaving. In particular, we needed a way to handle a drive that was constantly reporting transient errors, but not actually failing. Of course, in situations like this the drive may introduce poor performance to the cluster overall. The objective of this new feature is to have a mechanism that monitors for these misbehaving drives, and isolate them so that they do not impact the overall cluster.

The feature is monitoring vSAN, looking for significant period of high latency on the SSD or the magnetic disk drives. If a sustained period of high latency is observed, then vSAN will unmount the disk group on which the disk resides. These components in the disk group will be marked as permanent error and the components will be rebuild elsewhere in the cluster. What this means is that the performance of the virtual machines can be consistent, and will not be impacting by this one misbehaving drive.

An enhancement to this feature was added in vSAN 6.2. In this release, there are regular attempts over a period of time to remount caching and capacity tier disks marked under permanent error. This will only succeed if the condition that caused the initial failure is no longer present. If successful, the physical disk does not need to be replaced, although the components must be resynced. If unsuccessful, the disk continues to be marked as permanent error.

What About Stretching vSAN?

With the release of vSAN 6.1, VMware introduced support for vSAN stretched cluster. Using RAID-1 constructs VMs can be deployed on a vSAN stretched cluster with one replica on one site and another replica on another site. If a site fails, a full copy of the data still exists, and vSphere HA can handle the restarting of the virtual machines on the surviving site.

Let’s take a look at what this scenario would look like for vSAN in a stretched environment when FTT = 1 has been defined. The scenario shown in Figure 5.33 is what you want vSAN to do when it comes to placement of components.

Figure 5.33 - vSAN stretched cluster

A number of enhancements were essential to get this stretched cluster functionality.

Object placement: Through the use of fault domains, introduced with vSAN 6.0, the ability to deploy one copy of the data on site A and another copy of the data on site B can now be achieved.

Witness placement: We need to have a third site that functions as a tiebreaker when there is a partition/isolation event. To coincide with the vSAN 6.1 release, a witness appliance was also created which is essentially a stripped down ESXi running in a VM.

Support: VMware has done substantial testing to qualify bandwidth and latency requirements for vSAN stretched cluster. There was also considerable testing done to verify if vSAN stretched cluster could be work over L2 and/or L3 networks.

Summary

vSAN has a unique architecture that is future proof but at the same time extensible. It is designed to handle extreme I/O load and cope with different failure scenarios. Key, however, is policy-based management. Your decision-making during the creation of policies will determine how flexible, performant, and resilient your workloads and vSAN datastore will be.

results matching ""

    No results matching ""