Chapter 10 - Troubleshooting, Monitoring, and Performance

This chapter discusses the extensive suite of tools available to you to monitor and troubleshoot a vSAN environment, and also how to use these tools to investigate and quickly remedy issues on vSAN.

vSAN can leverage already-existing vSphere tools as well as some built-in tools specific to vSAN. This chapter covers the following tools:

  • Health check - A built-in feature that runs a series of tests on your vSAN cluster and reports any anomalies.
  • ESXCLI - The command-line interface (CLI) of the ESXi host.
  • Ruby vSphere console (RVC) - A generic tool for managing vCenter Server instances, but that has also been extended to support vSAN.
  • Performance service - A new feature available in vSAN 6.2 that provides detailed performance metrics on all aspects of vSAN.
  • vSAN observer - A web-based performance utility that leverages RVC.
  • ESXTOP - ESXi host performance monitoring tool.

It should also be noted that prior to vSAN 6.2 and the introduction of the performance service, traditional monitoring utilities, such as the vSphere web client, can continue to be used for vSAN, including performance views of individual VMs, and their respective VMDKs.

Health Check

We already introduced the health check in Chapter 7, “Management and Maintenance.” In this chapter we will delve into it in far greater detail. This utility should be an administrator’s starting point for any troubleshooting activity and most monitoring activities. The health check examines all aspects of a vSAN configuration, and reports back on any configuration issues or failures.

In the initial release of health check, which was made available for vSAN 6.0, administrators had to download the components from VMware and manually install them; first on vCenter Server, and then on the ESXi hosts that formed the vSAN cluster. However since the release of vSphere 6.0 U1, which include vSAN 6.1, the components required for the health check are pre-installed and always on. Administrators do not even need to enable the health check.

The vSAN health check is supported on both the Windows version of vCenter Server as well as the Linux/Appliance version.

All of the health checks that are available via the vSphere web client UI are also available via RVC, the Ruby vSphere console. We will discuss RVC in greater detail later on in this chapter.

Ask VMware

Another really useful aspect of the health check is the fact that every test has an “Ask VMware” link. For those of you not familiar with Ask VMware, these links will take administrators directly to a VMware knowledge base article detailing the purpose of the test, reasons why it might fail, and what can be done to remediate the situation. If any of the tests fail, administrators should always click on the Ask VMware button and read the associated KB article. In many cases, steps toward finding a resolution are offered. In others cases, administrators are urged to contact VMware support for further assistance. In Figure 10.1 a complete list of health checks is shown, as well as the location of the Ask VMware button. This can be clicked at any time to learn more about the actual health check.

Figure 10.1 - Ask VMware

Health Check Categories

In vSAN 6.2, there are a total of seven health check test categories by default. These are:

  • vSAN HCL health
  • Cluster health
  • Network health
  • Data health
  • Limits health
  • Physical disk health
  • vSAN performance service

If the vSAN is deployed as a stretched cluster, there is an additional set of health checks associated with that configuration. Let’s look at these health checks in more detail.

vSAN HCL Health

The vSAN HCL health (HCL is short for hardware compatibility list) verifies that the storage controller hardware and driver version are on the HCL and are supported for this version of vSAN. If the controller or driver is not on the HCL, or are not supported for this version of vSAN (namely the ESXi version on which vSAN is running), then the health check displays a warning.

Another check is to verify that the vSAN HCL DB is up-to-date. In other words, the checks that you are running are against a valid, up-to-date version of the HCL database. In Figure 10.2 we can see a warning being displayed because the HCL database is not up to date.

Figure 10.2 - Warning: vSAN HCL DB up-to-date

Since the HCL is updated regularly and frequently, administrators should update the local version of the database of these checks. This can be done online (if your vCenter Server has access to the VMware.com) or alternatively if your vCenter Server is not online, you can download a HCL DB file, and update it.

To update the version of the HCL DB online, simply click on the “Upload from file” or “Get latest version online” as shown in the health check test in Figure 10.2. An alternate method is to navigate to the vSAN cluster object in the vCenter Server inventory, select Manage, select Health and Performance, and then click on the “Get latest version online” button in the HCL database section. The “Last updated” field should now change to “Today,” as per Figure 10.3.

Figure 10.3 - Last updated: Today

Once you have the latest version of the HCL DB, the test associated with the up-to-date HCL DB should now pass. You should also retest the health check and see if this update to the HCL DB has addressed any warnings associated with the storage controller hardware support and/or driver version.

Figure 10.4 - vSAN HCL DB up-to-date passed

Note that additional health checks may be displayed in the HCL health if there are issues communicating with one or more hosts in the cluster from vCenter Server.

Cluster Health

The cluster health has a number of different tests associated with it. In the first place, it checks to make sure that the health check service is installed across all the hosts in the cluster. Second, it verifies that all hosts are running with an up-to-date version and finally that the health check service is working successfully.

There is also a health check to ensure that a number of advanced parameters relating to vSAN are consistently set across all the hosts in the vSAN cluster. This will avoid any issues arising with having a subset of hosts using one value and another set of hosts using another value for a particular advanced setting. Note that this test does not validate if the value is a “good” value or if it is even the “default” value. It simply verifies that all ESXi hosts in the vSAN cluster have the same value.

One final check that is worth highlighting is the CLOMD live-ness test. CLOM, the cluster level object manager, runs a daemon called clomd on each ESXi host in the cluster. CLOM is responsible for creating, repairing, and migrating objects, and is critical for handling various workflows and failure handling in vSAN. If the clomd on any host is not responding for some reason, this test will fail.

A number of additional cluster checks were introduced with vSAN 6.2 to handle space efficiency. Amongst the new checks, space efficiency refers to the new deduplication and compression features that are new in vSAN 6.2. These checks basically ensure that all hosts and all disk groups in the cluster are configured correctly with regards to space efficiency and highlights if there are any errors discovered across the cluster. Figure 10.5 shows the complete list of cluster health tests.

Figure 10.5 - Cluster health

Network Health

The network section of the health checks contains the most tests. It looks at all aspects of the network configuration, such as ensuring each host in the vSAN cluster has a VMkernel interface configured for vSAN traffic, being able to ping successfully between all hosts on the vSAN network interface, and that all interfaces having matching multicast configurations.

There are also network checks that ensure all of the ESXi hosts in the vSAN cluster are attached to vCenter Server, that none of the hosts have connectivity issues, and that there are no hosts in the cluster that are not participating in the vSAN cluster.

This is also the first health check to visit if there is a network partition. This health check will tell you which hosts are in which partitions, or if there is a complete network partition and every host is isolated from every other host (the latter is usually an indication of a multicast issue on the network). Figure 10.6 shows the complete list of network health checks.

Figure 10.6 - Network health

Data Health

Data health contains one test—vSAN object health. This test checks that all of the objects deployed on the vSAN datastore are healthy. This test will highlight unhealthy objects. There are a number of reasons for an unhealthy object. These vary from an object having reduced availability due to components being rebuilt or waiting to be rebuilt. It could also be due to an object being completely inaccessible, possibly due to multiple failures in the cluster. The test provides an object health overview showing the various states of the objects. If vSAN is waiting for the 60 minute CLOMD timer to expire before rebuilding absent components, an administrator can override this and initiate an immediate rebuild (e.g., in the case where a host may have failed and is not coming back soon). It can also show administrators if rebuilding of components is already in progress.

Limits Health

Limits health checks a number of vSAN cluster limits. In the “current cluster situation” test, it checks the component limits, which is currently 9,000 per host. It also checks the disk space utilization and finally checks the read cache reservation, and if any of these are over the limits threshold, a warning is thrown.

An additional limit check examines the impact a host failure will have on the limits in the cluster. If any of the limits are exceeded when a host failure is taken into account, another warning is raised. This is in some ways similar to admission control in vSphere HA, in which it assists administrators in monitoring whether or not there are enough resources for vSAN to self-heal in the case of a failure.

Physical Disk Health

Physical disk health contains a significant number of checks that examines multiple aspects of the vSAN storage. The overall disks health check looks for multiple issues on the physical disk drives, including surface issues, controller issues, driver issues, and issues with the I/O stack. This test will also fail if there are any errors encountered in this health check, such as metadata health, congestion, software state or disk capacity problems. If this test fails, administrators need to look at what other tests have failed to determine the root cause.

The component limit health test verifies that the upper limit of number of components per disk has not been exceeded. This is quite a large figure, in the region of 50,000, but this health check test will highlight if any disk is reaching saturation from a component count perspective.

Congestion is another interesting test for administrators. This can be an indicator that vSAN might be running at reduced performance. Reasons for congestion are varied, and usually require further investigation. Some examples are an undersized vSAN cluster for the workload running on it, bad hardware/driver/firmware on the storage controller or even software problems.

Disk capacity reports warnings if physical disk usage is starting to become an issue. If the physical disk space is below 80%, then the test reports as OK (green). If the usage of the disk is between 80% and 95%, health will show a warning (yellow). If the usage is above 95%, an alert (red) is thrown by the health check. Threshold is 80% where automatic rebalancing starts to occur.

One last set of tests to mention in the physical disk health is the memory pool tests. Although unlikely to occur, these check to ensure that the heaps and slabs used by vSAN are not running low. Should the health check tests show warnings, the advice is to contact VMware support to determine the reason why. Running low on memory pools can lead to performance issues or even operational failures.

Figure 10.7 shows a complete list of physical disk health checks. Other health checks might appear if there is difficulty communicating with a particular disk.

Figure 10.7 - Physical disk health

vSAN Stretched Cluster

vSAN 6.1 introduced a set of health checks for the vSAN stretched cluster use case. These looked at all the configuration aspects of stretched cluster and provide guidance if any of the health checks fail. This is an excellent starting point for anyone considering deploying vSAN in a stretched cluster configuration. Further information on vSAN stretched cluster may be found in Chapter 8, “Interoperability.”

vSAN Performance Service

The final health check related to a new feature of vSAN 6.2, namely the performance service. If the performance service has not been configured, this test will throw a warning stating that the Stats DB Object is not found, as per Figure 10.8.

Figure 10.8 - Performance service health when disabled

When the performance service has been enabled (we will see how to do this shortly) the vSAN performance service health check now has a number of additional health checks that hopefully all show OK/passed status, as per Figure 10.9. These tests ensure that all hosts are contributing data to the performance service, and that the Stats DB Object (which is in fact a VM home namespace object on the vSAN datastore) is healthy.

Figure 10.9 - Performance service health when enabled

Proactive Health Checks

Before leaving health checks, there is another feature of the health checks that is worth mentioning. This is of course the set of proactive health checks that are also included with the health check feature. These proactive health checks will run a set of tests that give you peace of mind that everything is functioning as expected in your vSAN cluster. Note that VMware does not recommend running these tests in production. These tests are primarily used to ensure everything is working as expected prior to placing vSAN into production, or during a proof-of-concept. There are three proactive tests:

  • VM creation test
  • Multicast performance test
  • Storage performance test

To access the proactive tests, select the cluster object in the vCenter Server inventory, and then select the Monitor tab, then vSAN and finally proactive tests. This will display the list of proactive tests that can then be selected and started by clicking on the start icon on the UI. When a test is selected, the start icon turns green, which means you can now start the test. Let’s look at each of the tests in more detail.

Figure 10.10 - Proactive tests

VM Creation Test

The VM creation test takes about 20 to 40 seconds to run, depending on the size of the cluster. The test creates a virtual machine on each host in the cluster, and then deletes the virtual machine afterwards. The task console can be monitoring to track the creation and deletion of virtual machines. This is a very useful test to verify that virtual machines can indeed be created on the vSAN datastore before starting the deployment of actual production-ready virtual machines. If there is an issue with deploying a virtual machine during the VM creation proactive test, the test will either fail immediately or it will timeout within 3 minutes. This test uses the default VM storage policy for the VM. It verifies a number of aspects of vSAN operations. It verifies that the network is configured properly on all the hosts, that the vSAN stack is operational on each host, and that the creation, deletion, and ability to do I/O to vSAN objects is all functioning. The test results will report which hosts failed and which hosts passed the test, and further diagnosis can then be carried out on the problematic host or hosts. The following is the result of a successful run of the VM creation test on a 4-node vSAN cluster.

Figure 10.11 - VM creation test

Multicast Performance Tests

This next test verifies that there is not only network connectivity between all of the hosts in the vSAN cluster, but also that it has sufficient speed and performance to allow the vSAN cluster to function as expected. The test selects one host to run the tests from, and then reports on the bandwidth (MB/s) achieved. In this 4-node cluster, we can see that vSAN was able to achieve over 80 MB/s on the 10 Gb network. This meets vSAN’s network requirements, so the test has succeeded.

Figure 10.12 - Multicast performance test

If the multicast performance test fails, one should examine the network configuration. If some of the hosts succeeded, and others failed, see if there is a configuration pattern that is shared by the hosts that succeeded and the hosts that failed. Are they on different switches for example? Perhaps the link/pipe between switches is not configured correctly, or does not have enough bandwidth. Multicast is an important component of vSAN, and administrators need to ensure it is configured correctly and functioning as expected before placing the vSAN into production.

Storage Performance Tests

It is possible that administrators will make significant use of this last proactive test. The nice thing about the storage performance test is that administrators can control the length of time that this test should run for. This means that administrators can run this overnight when vSAN is first deployed, and in fact there are different workloads that can be chosen for this exact purpose.

Figure 10.13 - Storage performance test workloads

If the test runs overnight, and everything passes, then you have peace of mind that all of the storage components (controller hardware, driver, firmware, cache tier devices, and capacity tier devices) are functioning correctly. If the test fails, then you have also managed to catch the issue before vSAN has been placed into production. As well as selecting a length of time for the test to run, and the type of performance check to run, administrators can also choose a particular VM storage policy. In previous versions of vSAN, it was always a RAID-1 configuration that was tested. With the release of vSAN 6.2, RAID-5 or RAID-6 policies can also now be selected.

It should now be obvious that the health check is the best tool to use when monitoring or troubleshooting a vSAN cluster. This should be the first tool to use when it comes to analyzing anomalies on the vSAN cluster. The integration with Ask VMware, and the multiple KB articles available to describe each test is extremely useful and should empower administrators to root cause many issues on the cluster.

However, there are other tools available for monitoring and troubleshooting vSAN alongside the health check. In the initial release of vSAN, these were the only tools available, as the health check was only made available in vSAN 6.0. We will examine these additional tools next. We will provide a few examples of how, when and where to use them. This is not intended to be an extensive troubleshooting guide, but should provide pointers to which options exists.

ESXCLI

ESXi 5.5 U1 introduced a new ESXi CLI (ESXCLI) namespace: esxcli vsan. This has a selection of additional namespaces that can be used for examining, monitoring, and configuring the vSAN cluster, as demonstrated in the following example:

~ # esxcli vsan
 esxcli vsan
 Usage: esxcli vsan {cmd} [cmd options]

Available Namespaces:
 cluster         Commands for vSAN host cluster configuration
 datastore       Commands for vSAN datastore configuration
 network         Commands for vSAN host network configuration
 storage         Commands for vSAN physical storage configuration       
 faultdomain     Commands for vSAN fault domain configuration         
 maintenancemode Commands for vSAN maintenance mode operation        
 policy          Commands for vSAN storage policy configuration
 trace           Commands for vSAN trace configuration

The sections that follow take a look at some of the options available.

esxcli vsan datastore

The esxcli vsan datastore namespace provides commands for vSAN datastore configuration. There is very little that can be done here other than get and set the name of the vSAN datastore. By default, the vSAN datastore name is vsanDatastore. If you do plan on changing the vsanDatastore name, do this at the cluster level via the vSphere web client. It is highly recommended that if you are managing multiple vSAN clusters from the same vCenter Server that the vSAN datastores are given unique, easily identifiable names, as shown in the next example.

~ # esxcli vsan datastore          
 Usage: esxcli vsan datastore {cmd} [cmd options]

Available Namespaces:          
 name    Commands for configuring vSAN datastore name.

~ # esxcli vsan datastore name          
 Usage: esxcli vsan datastore name {cmd} [cmd options]          
 Available Commands:          
 get     Get vSAN datastore name.          
 set     Configure vSAN datastore name. 

~ # esxcli vsan datastore name get          
Name: vsanDatastore

esxcli vsan network

This namespace provides commands for vSAN network configuration. It is somewhat more useful than the previous datastore namespace as it allows you to list the current configuration, clear the current configuration, restore the vSAN network configuration (this is used during the boot process by ESXi, and is not meant for customer invocation), as well as remove an interface from the vSAN network configuration, as shown in the next example.

~ # esxcli vsan network          
 Usage: esxcli vsan network {cmd} [cmd options]

Available Namespaces:          
 Ip      Commands for configuring IP network for vSAN.          
 ipv4    Compatibility alias for "ip"

Available Commands:          
 clear   Clear the vSAN network configuration.          
 list    List the network configuration currently in use by vSAN.          
 remove  Remove an interface from the vSAN network configuration.          
 restore Restore the persisted vSAN network configuration.

~ # esxcli vsan network list          
Interface          
  VmkNic Name: vmk2          
  IP Protocol: IP          
  Interface UUID: a6997656-d456-5d5f-091a-ecf4bbd69680          
  Agent Group Multicast Address: 224.2.3.4          
  Agent Group IPv6 Multicast Address: ff19::2:3:4          
  Agent Group Multicast Port: 23451          
  Master Group Multicast Address: 224.1.2.3          
  Master Group IPv6 Multicast Address: ff19::1:2:3          
  Master Group Multicast Port: 12345          
  Host Unicast Channel Bound Port: 12321          
  Multicast TTL: 5

What is interesting to view here is the multicast information. If you cast your mind back to the requirements in Chapter 2, “vSAN Prerequisites and Requirements for Deployment,” you might remember that there is a requirement to allow multicast traffic between ESXi hosts participating in the vSAN cluster.

Another interesting point to note is that the initial release of vSAN supported IPv4 only. vSAN 6.2 introduces support for IPv6. However, it is the multicast details that are of most interest. The Agent Group Multicast Port corresponds to the cmmds port that is opened on the ESXi firewall when vSAN is enabled. The first IP address, 224.2.3.4 is used for the master/backup communication, whereas the second address, 224.1.2.3, is used for the agents. esxcli vsan network list is a useful command to view the network configuration and status should a network partition occur.

Additional commands that can be useful for troubleshooting networking problems are as follows:

  • esxcli network diag ping: Tests the responsiveness of a VMkernel port
  • esxcli network ip neighbor list: Displays address resolution protocol (ARP) cache entries for all other vSAN nodes on the network
  • esxcli network ip connection list: Displays the user datagram protocol (UDP) connection information
  • tcpdump-uw: Sniffs the network traffic

esxcli vsan storage

This namespace is used for storage configuration, and includes options on how vSAN should claim disks, as well as the ability to add and remove physical disks to vSAN.

The command esxcli vsan storage automode allows you to get or set the autoclaim option.

To display the capacity tier and cache tier devices that have been claimed and are in use by vSAN from a particular ESXi host, you may use the list option. In this particular configuration, which is an all-flash configuration, SSDs are used for the capacity tier devices and a Micron PCI-E device is used for the cache tier. All devices have a true flag against the field Used by this host, indicating that they have been claimed by vSAN and the Is SSD field indicates the type of device (true for flash devices).

If you want to use ESXCLI to add new disks to a disk group on vSAN, you can use the add option. There is a different option to choose depending on whether the disk is a magnetic disk or an SSD (-d|--disks or -s|--ssd, respectively). Note that only disks that are empty and have no partition information can be added to vSAN.

There is also a remove option that allows you to remove magnetic disks and SSDs from disk groups on vSAN. It should go without saying that you need to be very careful with this command and removing disks from a disk group on vSAN should be considered a maintenance task. The remove option removes all the partition information (and thus all vSAN information) from the disk supplied as an argument to the command. Note that when a cache tier device is removed from a disk group, the whole disk group becomes unavailable.

If you have disks that were once used by vSAN and you now want to repurpose these disks for some other use (Virtual Machine File System [VMFS], Raw Device Mappings [RDM], or in the case of SSDs, vFRC [vSphere Flash Read Cache]), you can use the remove option to clean up any vSAN partition information left behind on the disk.

Additional useful commands for looking at disks and controllers include the following:

  • esxcli storage core adapter list: Displays the driver and adapter description, which can be useful to check that your adapter is on the hardware compatibility list (HCL)
  • esxcfg-info -s | grep “==+SCSI Interface” -A 18: Displays lots of information, but most importantly shows the queue depth of the device, which is very important for performance
  • esxcli storage core device smart get -d XXX: Displays SMART statistics about your drive, especially SSDs. Very useful command to display Wear-Leveling information, and overall health of your SSD
  • esxcli storage core device stats get: Displays overall disk statistics

esxcli vsan cluster

The esxcli vsan cluster command allows the ESXi host on which the command is run to get vSAN cluster information, as well as allow the ESXi host to leave or join a vSAN cluster. This can be very helpful in a scenario where vCenter Server is unavailable and a particular host needs to be removed from the vSAN cluster. The restore functionality is not intended for customer invocation and is used by ESXi during the boot process to restore the active cluster configuration from configuration file.

A get option to this command is useful for gathering information about the local ESXi hosts (node) health, as well as its role in the cluster

esxcli vsan faultdomain

Fault domains were introduced in vSAN 6.0, and allow vSAN to be rack aware. What this means is that components belonging to objects that are part of the same virtual machine can be placed not just in different hosts, but in different racks. This means that should an entire rack fail (e.g., power failure), there is still a full set of virtual machine components available so the VM remains accessible.

~# esxcli vsan faultdomain          
 Usage: esxcli vsan faultdomain {cmd} [cmd options]          
 Available Commands:          
 get Get the fault domain name for this host.          
 reset Reset Host fault domain to default value          
 set Set the fault domain for this host

Fault domains are also used for vSAN stretched cluster and 2-node configurations. These were already discussed earlier in the book. It is unlikely that you will need to use this esxcli vsan faultdomain namespace for such a configuration. VMware recommends using the vSphere web client UI for all fault domain, stretched cluster, and 2-node (ROBO) configurations.

esxcli vsan maintenancemode

maintenancemode is an interesting command option. You might think this would allow you to enter and exit maintenance, but it doesn’t. All this option allows you to do is to cancel an in-progress vSAN maintenance mode operation. This could still prove very useful, though, especially when you have decided to place a host in maintenance mode and selected the Full Data Migration option and want to stop this data migration process (which can take a very long time) and instead use the Ensure Access option. Note that you can place a node in maintenance mode leveraging esxcli system maintenanceMode set -e true -m noAction where “-m” specifies if components need to be moved or not.

~ # esxcli vsan maintenancemode          
 Usage: esxcli vsan maintenancemode {cmd} [cmd options]          
 Available Commands:          
 cancel Cancel an in-progress vSAN maintenance mode operation.

esxcli vsan policy

Virtual machine (VM) storage policies are something that we have covered in great detail in previous chapters of this book. vSAN associates a default storage policy with the VM’s storage objects and esxcli vsan policy namespace is one way to examine and modify this default storage policy using esxcli vsan policy setdefault.

esxcli vsan trace

esxcli vsan trace is a troubleshooting and diagnostic utility and should not be used without the guidance of VMware global support services (GSS). It is designed to capture internal diagnostics from vSAN for further analysis.

Additional Non-ESXCLI Commands for Troubleshooting vSAN

In addition to the esxcli vsan namespace commands, there are a few additional CLI commands found on an ESXi host that may prove useful for monitoring and troubleshooting.

osfs-ls

osfs-ls is more of a troubleshooting command than anything else. It is useful for displaying the contents of the vSAN datastore. The command is not in your search path, but can be found in /usr/lib/vmware/osfs/bin/. This command, can for instance list the contents of a VM folder on the vSAN datastore. This can prove useful if the datastore file view is not working correctly from the vSphere web client, or it is reporting inaccurate information for some reason or other.

cmmds-tool

cmmds-tool is another useful troubleshooting command from the ESXi host and can be used to display lots of vSAN information. It can be used to display information such as configuration, metadata, and state about the cluster, hosts in the cluster, and VM storage objects. Many other high-level diagnostic tools leverage information obtained via cmmds-tool. As you can imagine, it has a number of options.

The find option may be the most useful, especially when you want to discover information about the actual storage objects backing a VM. You can, for instance, see what the health is of a specific object.

vdq

The vdq command serves two purposes and is really a great troubleshooting tool to have on the ESXi host. The first option (vdq -q) to this command tells you whether disks on your ESXi host are eligible for vSAN, and if not, what the reason is for the disk being ineligible.

The second option (vdq –i –H) to this command is that once vSAN has been enabled, you can use the command to display disk mapping information, which is essentially which SSD or flash devices and magnetic disks are grouped together in a disk group.

Although some of the commands shown in this section may prove useful to examine and monitor vSAN on an ESXi host basis, administrators ideally need something whereby they can examine the cluster as a whole. VMware recognized this very early on in the development of vSAN, and so introduced extensions to the Ruby vSphere console (RVC) to allow a cluster-wide view of vSAN. The next topic delves into RVC.

Ruby vSphere Console

The previous section looked at ESXi host-centric commands for vSAN. These might be of some use when troubleshooting vSAN, but with large clusters, administrators may find themselves having to run the same set of commands over and over again on the different hosts in the cluster. In this next section, we cover a tool that enables you to take a cluster-centric view of vSAN. Since version 5.5 U1, VMware vCenter Server contains a new component called the Ruby vSphere console (RVC). The RVC is also included on the VMware vCenter Virtual Appliance (VCVA). As mentioned in the introduction, RVC is a programmable interface that allows administrators to query the status of vCenter Server, clusters, hosts, storage, and networking. For vSAN, there are quite a number of programmable extensions to display a considerable amount of information that you need to know about a vSAN cluster. This section covers those vSAN extensions in RVC.

You can connect RVC to any vCenter Server. On the VCVA, you log in via Secure Shell (SSH) and run rvc <user>@<vc-ip>.

For Windows-based Virtual Center environments, you need to open a command shell and navigate to c:\Program Files\VMware\Infrastructure\VirtualCenter Server\support\rvc. Here, you will find an rvc.bat file that you may need to edit to add appropriate credentials for your vCenter Server (by default, Administrator@localhost). Once those credentials have been set appropriately, simply run the rvc.bat file, type your password, and you are connected.

After you log in, you will see a virtual file system, with the vCenter Server instance at the root. You can now begin to use navigation commands such as cd and ls, as well as tab completion to navigate the file system. The structure of the file system mimics the inventory items tree views that you find in the vSphere client. Therefore, you can run cd <vCenter Server>, followed by cd <datacenter>. You can use ~ to refer to your current datacenter, and all clusters are in the “computers” folder under your datacenter. Note that when you navigate to a folder/directory, the contents are listed with numeric values. These numeric values may also be used as shortcuts. For example, in the vCenter folder in the next example there is only one datacenter, and it has a numeric value of 0 associated with it. We can then cd to 0, instead of typing out the full name of the datacenter:

> ls      
 0 /      
 1 mia-cg07-vc01/

cd 1      
  /mia-cg07-vc01&gt; ls      
  0 mia-cg07-dc01 \(datacenter\)      
  /mia-cg07-vc01&gt; cd 0      
  /mia-cg07-vc01/mia-cg07-dc01&gt; ls      
  0 storage/      
  1 computers \[host\]/      
  2 networks \[network\]/      
  3 datastores \[datastore\]/      
  4 vms \[vm\]/      
  vSAN Commands

If you want to learn about any command, run <command> -help. You can also use help and help <command-namespace> (like help vm, help vm.ip) to learn more about commands. Below is a condensed example output of help vsan.

help vsan          
 Namespaces:          
 health          
 perf          
 sizing          
 stretchedcluster          
 vsanmgmt

Commands:      
 apply_license_to_cluster: Apply license to vSAN      
 check_limits: Gathers (and checks) counters against limits      
 check_state: Checks state of VMs and vSAN objects      
 clear_disks_cache: Clear cached disks information      
 cluster_change_autoclaim: Enable/Disable autoclaim on a vSAN cluster      
 cluster_change_checksum: Enable/Disable vSAN checksum enforcement on a cluster      
 cluster_info: Print vSAN config info about a cluster or hosts      
 cluster_set_default_policy: Set default policy on a cluster      
 cmmds_find: CMMDS Find      
 disable_vsan_on_cluster: Disable vSAN on a cluster...

All the commands shown here must be prefixed by vsan. Therefore, to run enable_vsan_on_cluster, you must use vsan.enable_vsan_on_cluster. Remember that there is command completion, so you only need to type the first couple of characters of each command and then use the Tab key to complete the command (or display which commands match your current types command).

Of course, there is another set of commands in RVC that is also of interest to vSAN administrators. These are the SPBM (storage policy based management) commands and are used for all things related to VM storage policies. The next examples shows a list of SPBM commands in RVC:

> help spbm          
 Commands:          
 check_compliance: Check compliance          
 device_add_disk: Add a hard drive to a virtual machine          
 device_change_storage_profile: Change storage profile of a virtual disk          
 namespace_change_storage_profile: Change storage profile of VM namespace          
 profile_apply: Apply a VM Storage Profile. Pushed profile content to Storage system          
 profile_create: Create a VM Storage Profile          
 profile_delete: Delete a VM Storage Profile          
 profile_modify: Create a VM Storage Profile          
 vm_change_storage_profile: Change storage profile of VM namespace and its disks

 To see commands in a namespace: help namespace_name          
 To see detailed help for a command: help namespace_name.command_name

To get even more help on a particular command, simply precede the full command with help. Before we look at other troubleshooting tools, lets look at a simple example of enabling and disabling vSAN on a cluster

enable_vsan_on_cluster and disable_vsan_on_cluster

These commands do exactly what they say—enabling and disabling vSAN on the cluster. The only other way to do this is via the vSphere web client. You cannot do this via ESXCLI. With ESXCLI, you could get the ESXi host to join or leave the cluster, scaling it in or scaling it out, but there was no way to enable or disable the vSAN service on the cluster. The following commands (where 0 is a numeric representation of the cluster as shown in the ls output), allow you to do this:

/vcsa-05/Datacenter/computers>ls
0 Cluster (cluster): cpu 134 GHz, memory 305 GB          

/vcsa-05/Datacenter/computers>vsan.disable_vsan_on_cluster 0
ReconfigureComputeResource Cluster: success          
   esxi-b-pref.rainpole.com: running []          
   esxi-b-pref.rainpole.com: running []          
   esxi-b-pref.rainpole.com: running []          
   esxi-b-pref.rainpole.com: running []

If the command succeeds, the running status changes to success:

 esxi-b-pref.rainpole.com: success          
 esxi-c-scnd.rainpole.com: success          
 esxi-d-scnd.rainpole.com: success          
 esxi-a-pref.rainpole.com: success

Re-enabling vSAN is just as straightforward:

/vcsa-05/Datacenter/computers>vsan.enable_vsan_on_cluster 0
ReconfigureComputeResource Cluster: success          
   esxi-b-pref.rainpole.com: success          
   esxi-c-scnd.rainpole.com: success          
   esxi-d-scnd.rainpole.com: success          
   esxi-a-pref.rainpole.com: success

As already mentioned, there are many more options and commands provided by RVC. Explore these at your own convenience.

Troubleshooting vSAN on the ESXi

So far, we have looked at a lot of tools that are cluster centric, such as the RVC tool. Although we did discuss some ESXCLI commands available to administrators on an ESXi host, you also need to know where to look to find errors messages and log files. This small section highlights which log files to monitor, as well as some other utilities you might need to use when troubleshooting vSAN.

Log Files

You can find the vSAN log files in the ESXi host locations outlined in Table 10.1.

Log File Description Log File Locations
CLOM (cluster-level Object Manager logs) /var/log/clomd.log
OSFS (presents vSAN object storage as a file system) /var/log/osfsd.log
vCenter/ESXi communications /var/log/hostd.log
vSAN vendor provider /var/log/vsanvpd.log
ESXi log /var/log/vmkernel.log

Table 10.1 - vSAN Log File Locations

You may also find references to the major software components of vSAN, such as LSOM, RDT, DOM, and PLOG. GSS recommends searching the VMkernel log files for entries containing these keywords when troubleshooting vSAN issues. If you are unfamiliar with these software components, revisit Chapter 5. It provides detailed information about the role played by each of the components.

vSAN Traces

We briefly touched on the trace utility back in the ESXCLI section of this chapter. vSAN uses a compressed binary trace format for logging multiple messages per I/O. The traces go into /var/log/vsantraces/. These traces are not human readable and must be extracted before they can be viewed. To decode the vSAN traces into human-readable “log messages,” you can run the following commands on the ESXi host:

cd /var/log/vsantraces/
zcat <file>.gz | /usr/lib/vmware/vsan/bin/vsanTraceReader.py > <file>.txt

When this command is run, <file>.txt will contain a human-readable form of the trace.

For hosts with less than 512 GB of memory, booting the ESXi image from USB/SD devices is supported. For hosts with a memory configuration larger than 512 GB, ESXi needs to be installed on a local disk or a SATADOM device. The reason for this is that in the event of a critical failure, ESXi is setup to dump its memory state. With memory sizes above 512 GB, there is not enough space on SD/USB devices to capture this state. Therefore a device with a larger capacity is needed. VMware global support services and the VMware engineering teams use this core dump information and relevant vSAN traces for root cause analysis. When installing ESXi on USB or SD, note that you should use a device that has a minimum capacity of 8 GB.

vSAN VMkernel Modules and Drivers

ESXi comes with all of the components required to build a vSAN cluster. No additional VIBs or software components need to be added to the host to successfully create a vSAN cluster and build a scaled-out vSAN datastore.

When vSAN is successfully configured, you will observe new VMkernel modules loaded for the purposes of implementing vSAN. The VMkernel module names are vsan, rdt, plog, and lsomcommon.

When a VM does a write operation, the write goes to SSD, and then vSAN regularly flushes (or evicts) the SSD contents to magnetic disks. The plog module implements the vSAN elevator algorithm. It looks at the physical layout of the magnetic disk and decides when to flush SSD contents to it.

The vsan module can be thought of as the module for both the LSOM and DOM components. As these modules are heavily intertwined, the lsomcommon contains shared code for these components, although these are now separated in vSAN 6.0 and later, and lsom now has its own module.

The rdt module is the reliable datagram transport module responsible for cross-cluster vSAN communication.

Performance Monitoring

One of the most important aspects of managing storage is to be able to monitor and troubleshoot performance issues. vSAN is no different. In this section, we share with you various tools that are at a vSphere administrator’s disposal for monitoring and troubleshooting vSAN performance-related issues.

Introducing the Performance Service

In vSAN 6.2, the issue of being unable to determine the performance of the different parts of vSAN was addressed. With the introduction of the new vSAN 6.2 performance service, administrators can now drill down into the various parts of a vSAN Cluster and examine the underlying performance. Administrators can now determine how a host, disk group, individual disk or virtual machine is preforming on a vSAN cluster.

Enabling the performance service

Enabling the performance service is quite straightforward. Navigate to the vSAN cluster object in the vCenter Server inventory, click the Manage tab and click on the Health and Performance section. The performance service will be disabled by default, as per Figure 10.14.

Figure 10.14 - Performance service is turned off

To enable the performance service, click on the “Edit” button. When the performance service is enabled, a VM home namespace object, with 255 GB of capacity for the storing of metrics, is created on the vSAN datastore. This requires a policy, but it will choose the vSAN datastore default policy automatically. Administrators can choose a different policy if they wish. Once enabled, the performance service will look similar to Figure 10.15.

Figure 10.15 - Performance service is turned on

Now that the performance service is enabled, it may be used for examining vSAN performance.

Using the vSAN performance service

Once the vSAN performance service has been enabled, multiple performance views are now available to the administrators. The views are as follows:

  • Cluster Performance
    • Virtual Machine Consumption
    • vSAN Backend
  • Host Performance
    • Virtual Machine Consumption
    • vSAN Backend
  • Disk Performance
    • Disk Group
    • Disk
  • Virtual Machine
    • Virtual Machine Performance
    • Virtual Disk Performance

To monitor the vSAN performance for any of the above, select the appropriate inventory object (cluster, host, VM), then select the Monitor tab and then select Performance. The performance views will then be available for selection. Figure 10.16 shows an example of the cluster performance view, albeit taken from a very idle, pre-production vSAN system.

Figure 10.16 - Cluster Performance—Virtual Machine Consumption

Performance Service Metrics

Each of the view has a set of metrics which administrators will find useful. These are IOPS, throughput, outstanding IO, latency and congestion. These can be examined both from a front-end perspective (VM workload perspective) and from a vSAN backend perspective. As previously mentioned, administrators can drill down into the performance from a cluster level, host level, disk group level, disk level or even to a VM level.

When examining the vSAN back-end metrics on a per host level, some additional counters are available to look at resync/rebuild activity, including the number of resync IOPs and resync throughput.

Further granularity of performance metrics is available in the disk groups and disk views. In the disk group view, read cache hit rate (for hybrid systems) is available, as well as information related to delayed IO. Delayed IO is a good indicator of determining if the amount of outstanding IO is correctly configured, especially for benchmarking. Outstanding IO should be tuned to keep the pipeline full at all times and the device continuously busy. You don’t want to go over this and have I/O delayed waiting to get into the queue. This introduces latency. If delayed IO percentage is over 0%, then there is too much outstanding IO in the queue. Delayed IO average latency is telling us how much this delayed IO is adding.

Just like the health check service, the performance service also has the “Ask VMware” capability. Should you need further information on any of the metrics or counters, simply click the “Ask VMware” link and this should provide information related to the metric.

ESXTOP Performance Counters for vSAN

In the initial release of vSAN, there were no vSAN datastore–specific performance counters in esxtop. In vSAN 6.0 and later, specific counters related to vSAN were added to esxtop. However, aside from the vSAN specific counters, esxtop can still be a very useful tool when you want to examine VM activity, VMDK performance, host status, memory usage, adapter queue depth and of course, disk activity on an ESXi host basis. Esxtop is quite easy to use; at a shell prompt on an ESXi host, simply type esxtop.

Figure 10.17 shows some sample esxtop output from the initial release.

Figure 10.17 - esxtop output

Typing in the character h while esxtop is running can access help. The following display options are available:

c: CPU
i: Interrupt
m: Memory
n: Network
d: Disk adapter
u: Disk device
v: Disk VM
p: Power management
x: vSAN

The vSAN view displays three roles; client, owner, and component manager. These roles were covered in Chapter 5. I/O statistics related to each role may be observed on a per host basis with esxtop on vSAN 6.0 or later. But what about a deeper view? In vSAN 6.2, the new performance service can also give you this information. However we also have another tool that provides vSAN-centric performance statistics, namely the vSAN observer tool. We will examine this shortly.

vSphere Web Client Performance Counters for vSAN

Prior to vSAN 6.2, the vSphere client does not have any specific performance counters for vSAN datastore. If you navigate to the vSAN cluster object in the vCenter Server inventory, select the Monitoring tab, and then select Performance view, there is an option to change the chart options. You will notice that once again nothing specific for the vSAN datastore is available.

However, the performance views available in the vSphere client for both the VMs and the VMDKs work perfectly, even when the VM is deployed on the vSAN datastore. Figure 10.18 shows performance information, in this case highlighting read latency and write latency of VMDKs.

Figure 10.18 - vSphere web client performance view

As mentioned in the esxtop section, apart from the new vSAN 6.2 performance service, the vSAN observer tool is what we can use to display information about vSAN performance. In versions of vSAN prior to 6.2, this is really the only tool available for performance monitoring of the underlying vSAN layers. We shall look at this tool next.

vSAN Observer

The vSphere web client in vSphere 5.5 U1 ships with a number of built-in vSAN management functions. For example, you will find vSAN datastore as well as VM-level performance statistics in the vSphere web client. If you require in-depth vSAN performance, however, down to the physical disk layers, understanding the cache hit rates, reasons for observed latencies, and so on, the vSphere web client will not deliver this level of detail in vSphere 5.5 U1. In vSAN 6.2, the new vSAN performance service should now more than fill this gap. However the performance service was not available in previous vSAN versions. That is where the vSAN observer comes in.

The vSAN observer is part of the Ruby vSphere console (RVC), an interactive command-line shell for vSphere management as we have already seen, and RVC is part of both the Windows vCenter Server and VCVA (appliance version of vCenter Server) in vSphere 5.5 U1.

Let’s talk a bit about requirements and how to deploy the vSAN observer before delving into what it can do for you.

vSAN Observer Requirements

vSAN observer is a performance tool that has been specifically written to display vSAN performance information, and the primary use case is advanded performance troubleshooting by VMware GSS. It requires a modern web browser and a working Internet connection (because certain open source software components need to be downloaded for it to work). It also requires vCenter 5.5 U1 Server or later, either the Linux appliance version (VCVA) or the Windows version. Ruby vSphere console and RVC are preinstalled in vCenter Server 5.5 U1.

You have two deployment options:

  • You can use RVC inside your production vCenter Server that is managing your vSAN clusters.
  • You can deploy an additional vCenter Server, just to get RVC and the vSAN observer tool.

In a lab environment, the former is likely more convenient. Administrators need to be aware that the vSAN Observer opens up an unencrypted and unhardened HTTP server. Doing this on your production vCenter Server may very well be against your security policies, which is why VMware created the standalone option. In such a case, deploying an additional server to run RVC may be a better option for you.

Running the vsan.observer command with the name of your cluster as an argument will launch vSAN observer. This command will gather statistics from vCenter Server and vSAN every x seconds. The default interval value defaults to 60 seconds for statistics collection, but you may specify a smaller or larger interval via the --interval parameter. It will currently collect information for a 2-hour period.

Typically, you want to run the command with the -run-webserver option, which opens an unencrypted HTTP web server on port 8010. You can change the port number with the --port option. Because we have already looked at the steps to launch RVC on a Windows version of vCenter Server in the previous version, let’s now look at the steps required to get RVC up and running (and thus launch the vSAN Observer) on the vCenter Server Linux appliance (VCVA):

  1. Open an SSH session to your vCenter Server Appliance: ssh root@<name or ip of your VCVA>
  2. Open RVC using your root account and the vCenter name, in my case: rvc root@localhost
  3. Now do a cd in to your vCenter object (you can do an ls to see what the names are of your objects on any level), and if you press the <tab> key it will be completed with your datacenter object: cd localhost/<Name-of-your-datacenter> /
  4. Now do a cd again. The first object is computers and the second is your cluster. In my case that looks like the following: cd computers/<Name-of-your-vSAN-cluster>/
  5. Now you can start the vSAN observer using the following command: vsan.observer . --run-webserver --force
  6. Now you can see the observer querying stats every 60 seconds, and as mentioned you can stop this by pressing Ctrl+C. The collection will stop automatically after a period of 2 hours.

After completing these preparation steps, you can now examine the in-depth vSAN performance data. Begin by opening a web browser and pointing it at http://<rvc-vc-ip>:<observer-port>. The <rvc-vc-ip> is the IP address of the host running RVC, not the IP address of the vCenter Server that you are monitoring (although they could be the same). The port defaults to 8010, but you may have changed it via the --port option. We recommend using Google Chrome, but any modern browser should work. Internet Explorer 8 is not considered a modern browser, but may still work to some extent. Older versions of IE will definitely give you problems.

Figure 10.19 shows what the vSAN observer landing page looks like.

Figure 10.19 - vSAN Observer: vSAN Client view

vSAN Observer will run until you tell it to stop via Ctrl+C. Note that it will keep the entire history of your observer session in memory until you press Ctrl+C, meaning if you run it for many hours it will use multiple gigabytes of RAM. This is another reason why you may prefer to run the vSAN observer on a dedicated vCenter Server.

Examining vSAN Observer Performance Data

When you first open up vSAN observer, the primary indicator of an issue that is abnormal is that there will be a red underline for the graph in question that is outside normal operating boundaries. Graphs in vSAN observer typically show green for normal state, or gray if there is no information or not enough information yet available. Red is your indicator to start investigating, and is displayed when 20% of the samples taken during the sampling period are outside the configured threshold.

The vSAN Observer UI is organized by subsystem. You should start with the vSAN Client view, which gives you an overview over what level of service the VMs are getting from vSAN. Every host in a vSAN cluster (and hence every “vSAN Client”) is consuming storage distributed across all other hosts in the cluster, so seeing a performance issue on the vSAN client on host A may in fact be due to overloaded disks on host B.

The “vSAN disks” view allows you to look at vSAN from that perspective, checking how nodes that contribute storage to the vSAN datastore are doing in terms of servicing I/O from their local disks. You can then further drill down into a deep-dive of the vSAN disks layer on a per-host basis, seeing how vSAN splits I/O among SSDs and HDDs.

Figure 10.20 shows the vSAN disks view. As you can see, there is a lot of information displayed here. This view shows everything from latency, IOPS, bandwidth, congestion, outstanding I/O, and a standard deviation on latency, which is how much latency has deviated from the average. Once again, you are looking for charts with are underlined in red; these highlight a metric that is outside the norm. That is where investigation into disk-related issues should begin. For latency, the threshold level is set to 30 ms. Bandwidth is measured in kilobytes per second (KB/sec). Congestion is a measurement of 1 to 255; 1 means there is no congestion, 255 means it is fully congested. The threshold value for congestion is set to 75.

Figure 10.20 - vSAN Observer: vSAN disks

vSAN shares compute resources with the rest of ESXi; that is, vSAN is consuming a slice of the same CPU and memory resources that the VMs running on a given host are also consuming. vSAN has been designed to consume no more than 10% of CPU resources. You can inspect the vSAN PCPU (physical CPU) and memory consumption in dedicated tabs in the observer, which may also be useful in detecting performance bottlenecks due to CPU or memory limits.

Figure 10.21 shows the memory consumption of not just the vSAN components, but also other consumers of memory of the various ESXi hosts participating in the vSAN cluster.

Figure 10.21 - vSAN Observer: Memory

Because vSAN is a type of storage that is managed in a very VM-centric way (per-virtual disk using VM storage policies), you can also look at performance on a per-VM or even per-virtual disk level in the vSAN observer. Start in the VMs tab and select the VM that you want to get more detail on.

Figure 10.22 shows the VM Home space from one VM. For each component that makes up the object, information such as latency, IOPS, read cache (RC) hit rate, and cache evictions are all displayed. This is excellent information for determining whether any issues exist with any of the components that make up the storage object of a particular VM that may be exhibiting performance issues. Evictions are a reference to the fact that entries in the cache are being flushed out of cache to magnetic disks. High values here could suggest that there could be contention for cache resources, possibly implying that flash has not been sized correctly. Read cache hit rate is also an interesting graph, because anything below 100% implies that we have had a read cache miss and have had to go to magnetic disk to retrieve a data block, which will increase latency.

Figure 10.22 - vSAN Observer: VMs view

Last but not least, there are tabs for auxiliary information (on cluster balance, distribution of objects, significant cluster events, and so on). Every time you switch tabs, the graphs update automatically and reflect the latest information gathered by RVC in the background.

Most tabs contain information about how to read the information presented in the graphs. However, a lot of them require familiarity with storage performance. Having said that, the more you use the vSAN Observer tool, and the more familiar you get with how your environment should be running in steady state, the more useful this tool will become when you need to troubleshoot issues that occur outside the norm.

As you can imagine, we have only scratched the surface of what you can do with vSAN observer.

Summary

As you can clearly see, an extensive suite of tools is available for troubleshooting and monitoring a vSAN deployment. We heard from a lot of VMware customers that they no longer wished for their storage to be a “black box” where visibility into performance was next to impossible. With this extensive suite of CLI and UI tools, customers can drill down into the lowest levels of vSAN behavior.

results matching ""

    No results matching ""