Chapter 9 - Designing a vSAN Cluster
This chapter walks you through all the steps required to enable you to design your perfect vSAN cluster. We will leverage all the insights provided throughout the various chapters to ensure that your vSAN cluster will meet your technical and business requirements. We do want to point out before we go through the various exercises that the VMware compatibility guide (VCG) contains a list of predefined configurations that are called vSAN ready nodes. We recommend using these vSAN ready nodes as a starting point for your design. It is less error prone than configuring your own server and selecting each hardware component manually. At the time of this writing, there are 13 different server vendors in the VMware compatibility guide (http://vmwa.re/vsanhcl).
Before running through the design exercises, we would like to discuss some constraints of vSAN but let’s discuss the different ready node profiles first.
Ready Node Profiles
With the release of vSAN 6.1 the ready node program was also overhauled. The old profile names “high–medium–low” were deprecated and a more extensive list of different profiles and configurations was introduced. Table 9.1 lists the ready node configurations that were part of the VCG at the time of writing and the different configuration items with each model. In the table below, the hybrid configurations are represented with HY models, and the all-flash configurations are represented with AF models.
Table 9.1 - vSAN Ready Node Models
|Model||CPU/Memory||Storage Capacity||Storage Performance||VMs per Node|
|HY-2||1 x 6 core / 32 GB||2 TB||4K IOPS||Up to 20|
|HY-4||2 x 8 core / 128 GB||4 TB||10K IOPS||Up to 30|
|HY-6||2 x 10 core / 256 GB||8 TB||20K IOPS||Up to 50|
|HY-8||2 x 12 core / 348 GB||12 TB||40K IOPS||Up to 100|
|AF-4||2x10 core / 128 GB||4 TB||25K IOPS||Up to 30|
|AF-6||2x12 core / 256 GB||8 TB||50K IOPS||Up to 60|
|AF-8||2x12 core / 348 GB||12 TB||80K IOPS||Up to 120|
In the previous version of the ready node program HY-2, HY-6, and HY-8 already existed, but as stated, earlier were called low, medium, and high. HY-4, AF-6, and AF-8 were introduced with vSAN 6.1, AF-4 was introduced with vSAN 6.2. HY-4 explicitly was introduced to bridge the gap between the low (HY-2) and medium (HY-6) model and, in our experience, this is what most customers who are looking to implement a hybrid vSAN solution start with.
When it is known what the total amount of required compute and storage capacity is, designers can easily determine which ready node (from your preferred vendor) is the closest match via the VMware compatibility guide. In Figure 9.1 we have selected ESXi 6.0 Update 1 as the version of vSAN that the ready node needs to support, we have selected Fujitsu as our preferred server brand and 8 TB as the total amount of raw storage capacity per host. As shown the HY-6 and AF-6 models are displayed as potential matches.
Figure 9.1 - Ready node configuration based on storage capacity selected
As with any platform, vSAN has some constraints that need to be taken into consideration when designing an environment. Some constraints are straightforward, others are less obvious. The following provides a summary of the vSAN 6.2 constraints:
- Maximum of 64 hosts per cluster
- Maximum of 200 virtual machines (VMs) per host
- Maximum of 6400 VMs per cluster
- Maximum of 5 disk groups per host
- Maximum of 7 disks per disk group
- One cache flash device per disk group
- Maximum of 9,000 components per host
Some of you may wonder what has happened to the “maximum number of vSphere HA protected VMs on a vSAN datastore” item that was originally on this list. This is no longer a concern. vSphere HA had a limitation originally with regards to the number of VMs it could track per datastore in the “power-on list.” Changing the design of vSphere HA and accommodating for solutions like vSAN where a single datastore can hold thousands of VMs has removed this limitation. This file no longer has any form of limitation and as such the maximum number of vSphere HA protected VMs per datastore is equal to the maximum number of VMs per datastore.
As said, the majority of these are straightforward. There are two worth explaining further:
- Number of hosts in a cluster
- The maximum number of components
Although 64 hosts per cluster is a hard limit, we’ve yet to encounter a customer who experiences this as an actual limit. In all conversations the authors have had with customers, and this is also our recommendation generally speaking, the cluster boundary should be considered the fault domain boundary. Most customers end up with cluster sizes between 3 and 24 hosts, with 8 host and 12 host clusters being most common. What is worth mentioning is that in order to scale up from 32 hosts to 64 hosts, an advanced setting needs to be set on each of the ESXi hosts in the cluster. This option was disabled by default as in order to scale up to 64 hosts, vSAN requires an additional ~200 MB of memory per host. Note that a reboot is required for this setting to take effect. For more details we like to refer to the VMware Knowledgebase Article 2110081— http://kb.vmware.com/kb/2110081\ as follows on each of the hosts in your cluster:
esxcli system settings advanced list -o /CMMDS/goto11
esxcli system settings advanced set –o /Net/TcpipHeapMax –i 1024
esxcli system settings advanced set -o /CMMDS/clientLimit 65
The second constraint that is worth discussing is the maximum number of components per host. The maximum number of components in the initial release of vSAN was 3,000 per host; this has been increased to 9,000 per host with vSAN 6.0, and remains the same for vSAN 6.1 and 6.2. Various types of objects may be found on the vSAN datastore that contain one or more components:
- VM namespace
- VM swap
- VM disk
- Virtual disk snapshot
- Snapshot memory
As you are no doubt aware at this stage in the book, each VM has a namespace, a swap file, and typically a disk. It is important to understand that the number of failures to tolerate (FTT) plays a critical role as well when it comes to the number of components. The higher you configure FTT, the more components certain objects will have. Meaning that when you configured FTT to 1, and your failure tolerance method is set to RAID-1, this means your disk object will have two mirrors (or in other words, two components) and typically a witness, resulting in three components in total. This also applies to the stripe width. If it is increased to larger than 1, the number of components will also go up. Therefore, if a virtual machine disk (VMDK) object is striped across two disks, you will have two components. For RAID-5, an object will have four components and for RAID-6, an object will have six components.
On top of that, the maximum size of a component is 255 GB, meaning that if you have a 500 GB virtual disk object, the object is configured with one 255 GB component and one 245 GB component. This is important to realize when it comes to scaling and sizing.
Having that said, with the vSAN 6.0 release increasing the limit from 3,000 to 9,000 we have not encountered any customers hitting this constraint or even perceiving it to be a constraint.
Cache to Capacity Ratio
Since the very first version of vSAN, the recommendation from VMware is a 10% ratio of cache versus capacity required before taking NumberOf FailuresToTolerate into account. What does this mean?
Let’s run through a short scenario as that will make it clear instantly. Let’s assume we have the following in our environment:
- 100 VMs
- 50 GB per VM
- FTT = 1
- Failure tolerance method = RAID-1
The math is simple on this one, required space for the capacity tier and the cache tier will be as follows (not taking any overhead and slack space in to account for now):
- Capacity tier: 100 VMs 50 GB 2 replicas = 10,000 GB
- Cache tier: 100 VMs * 50 GB = 10% of 5,000 GB = 500 GB
This is the total requirement for the complete cluster by the way. In the case of a 4-node cluster this means that the required amount of cache capacity per host is only 125 GB while the capacity tier would require 2,500 GB.
Designing for Performance
One critical aspect when designing a vSAN infrastructure is, of course, the performance aspect. During the various scenarios described in the examples that follow, our focus is on capacity sizing and partly performance. Performance is mainly a consideration in hybrid configurations. We are, however, slowly seeing a change in terms of all-flash adoption. The prices of flash have come down dramatically (and will continue to do so). With the introduction of RAID-5/6 and deduplication and compression in vSAN 6.2 we expect that, from a total cost of ownership perspective, all-flash configurations will make more sense for the majority of you.
As you have learned throughout the book by this point, vSAN heavily leans on flash devices to provide the required performance capabilities. Flash is leveraged both as a read cache and as a write buffer for hybrid configurations and just as a write buffer for all-flash configurations. Therefore, incorrectly sizing your flash capacity can have a great impact on the performance of your workload. As an example, the difference between a 30 GB read cache or a 150 GB read cache for 30 VMs is huge. Having ~ 1 GB or 5 GB per VM available does make a difference in terms of reducing the need to resort to magnetic disk in the case of a hybrid configuration. But the same applies to the write buffer. Whether you have 600 GB to your disposal to collect writes and destage them when needed or mere 100 GB. It will have an impact on destaging frequency, which in its turn can impact the capacity tier from an endurance point of view. Not only will the size of the flash device make a difference to your VMs, so will the type of flash used!
At the end of the day, all the best practices mentioned are recommendations, and these will apply to the majority of environments; however, your environment may differ. You may have more demanding applications. What truly matters is the aggregate active (hot) data set of the applications executing in the cluster. In practice, you will have to estimate that or leverage the vSAN assessment tool that is available for free and can be requested online by filling out a simple form. (http://vmwa.re/vsanass) The result of this assessment will tell you what the active working set is in your environment and, for instance, which VMs are recommended for vSAN. It is possible to run different analysis based on all-flash versus hybrid vSAN for instance, as shown in Figure 9.2.
Figure 9.2 - vSAN assessment tooling
Chapter 2, “vSAN Prerequisites and Requirements for Deployment,” listed the categories of flash devices that VMware uses to provide an idea around potential performance that can be achieved when using such a device.
The list of the designated flash device classes specified within the VMware compatibility guide (VCG) is as follows:
Class A: 2,500–5,000 writes per second(no longer on the VCG)
- Class B: 5,000–10,000 writes per second
- Class C: 10,000–20,000 writes per second
- Class D: 20,000–30,000 writes per second
- Class D: 30,000–100,000 writes per second
- Class E: 100,000+ writes per second
Just to demonstrate the difference between the various devices, we will list the theoretical performance capabilities of some of the devices in the described classes:
- Intel P3700, 800 GB, 90k Random Write IOPS, 460k Random Read IOPS—using 4K blocks (http://www.intel.com/content/www/us/en/solid-state-drives/intel-ssd-dc-family-for-pcie.html
- Intel S3610, 1.2 TB, 28k Random Write IOPS, 84k Random Read IOPS—using 4K blocks(http://www.intel.com/content/www/us/en/solid-state-drives/solid-state-drives-dc-s3610-series.html)
- Micron P320h, 700 GB, 145k Random Write IOPS, 415k Random Read IOPS—using 4K blocks (https://www.micron.com/products/solid-state-storage/product-lines/p320h#/)
- Micron M500DC, 800 GB, 24k Random Write IOPS, 65k Random Read IOPS—using 4K blocks (https://www.micron.com/products/solid-state-storage/product-lines/m500dc)
Now if you would have 400 VMs on four hosts, using Intel P3700 NVMe devices versus regular Intel S3610 devices could make a substantial difference. Just imagine you have 2,000 VMs on 16 hosts. Of course, the price tag on these two examples also significantly varies, but it is a consideration that needs to be taken into account. Whether the device is used for caching or for capacity will also make a difference. Generally speaking high performance and write optimized drives like the Intel P3700 and the Micron P420m are used for the caching tier, while devices like the Intel S3610 and the Micron M500DC are used for the capacity tier.
Impact of the Disk Controller
A question that often arises is what the impact of the disk controller queue depth is on performance of your vSAN environment. Considering the different layers of queuing involved, it probably makes the most sense to show the picture from VM down to the device, as illustrated in Figure 9.3.
Figure 9.3 shows that there are six different layers at which some form of queuing is done, although in reality there are even more buffers and queues (but we tried to keep it reasonably simple and depict the layers that potentially could be a bottleneck). Within the guest, the vSCSI adapter has a queue. Then the next layer is vSAN, which, of course, has its own queue and manages the I/O. Next the I/O flows through the multi-pathing layer to the various devices on the host. On the next level, a disk controller has a queue; potentially (depending on the controller used), each disk controller port has a queue. Last but not least, of course, each device (i.e., disk) will have a queue.
If you look closely at Figure 9.3, you see that I/O of many VMs will all flow through the same disk controller and that this I/O will go to or come from one or multiple devices (usually multiple devices). This also implies that the first real potential “choking point” is the queue depth of the disk controller.
Figure 9.3 - Different queuing layers
Assume you have four SATA disks, each of which has a queue depth of 32. Total combined, this means that in parallel you can handle 128 I/Os. Now what if your disk controller can handle only 64? This will result in 64 I/Os being held back by the VMkernel/vSAN. As you can see, it would be beneficial in this scenario to ensure that your disk controller queue can hold the same number of I/Os (or more) as your device queue can hold, allowing for vSAN to shape the queue anyway it prefers without being constrained by the disk controller itself.
When it comes to disk controllers, a huge difference exists in maximum queue depth value between vendors, and even between models of the same vendor. Just for educational purposes, Table 9-2 lists five disk controllers and their queue depth to show what the impact can be when making an “uneducated” decision.
|Manufacturer||Disk Controller||Queue Depth|
|Dell||PERC H730 Adapter||895|
|HP||Smart Array P440ar||1,011|
|Intel||Integrated RAID RS3PC||600|
Table 9.2 - Disk Controller Queue Depth
For vSAN, it is recommended to ensure that the disk controller has a queue depth of at least 256, do note that all devices on the VCG have a queue depth higher than 256 as that is one of the criteria before they are accepted to the list. Now the disk controller is just one part of the equation, because there is also the device queue. We have looked at various controllers and devices, and here are the typical standards you will find:
Max RAID Device Queue Depth (default=128)
Max SATA Device Queue Depth (default=32)
Max SAS Device Queue Depth (default=254)
We have highlighted the important parts here. As you can see, the controller in this case has three different queue depths depending on the type of device used. When a RAID configuration is created, the queue depth will be 128. When a SAS drive is directly attached, often referred to as pass-through, the queue depth will be 254. The one that stands out the most is the queue depth of the SATA device; this is by default only 32, and you can imagine this can once again become a choking point. This is one of the reasons there is a very limited number of SATA drives on the VMware compatibility guide for vSAN. Up to vSphere 5.5 Update 3 there are still some devices listed, but these have been removed with the release of vSAN 6.0. Fortunately the shallow queue depth of SATA (and the lack of SATA drives on the VCG) can easily be overcome by using NL-SAS drives (nearline serially attached SCSI) instead, which has a much deeper queue depth.
You can validate the queue depth of your controller using esxcfg-info Đs | grep “==+SCSI Interface” -A 18 as it will display a lot of information related to your SCSI interfaces including the queue depth as shown in the following output of this command which was truncated for readability reasons.
\==+SCSI Interface :
Note that even the firmware and driver used can have an impact on the queue depth of your controller and devices. We highly recommend using the driver listed on the vSAN Compatibility Guide (http://vmwa.re/vsanhcl) shown on the details page of a specific disk controller, as shown in Figure 9.4. In some cases, it has been witnessed that the queue depth increased from 25 to 600 after a driver update. As you can imagine, this can greatly impact performance.
Figure 9.4 - Device driver details
The question now that usually arises is this: What about NL-SAS versus SATA drives? Because NL-SAS drives are essentially SATA drives with a SAS connector, what are the benefits? NL-SAS drives come with the following benefits:
- Dual ports allowing redundant paths
- Ability to connect a device to multiple computers
- Full SCSI command set
- Faster interface compared to SATA, up to 20%, no STP (Serial ATA Tunneling Protocol) overhead
- Deeper command queue (depth)
From a cost perspective, the difference between NL-SAS and SATA for most vendors is negligible. For a 4 TB drive, the cost difference on different Web sites at the time of this writing was on average $30. Considering the benefits, it is highly recommended to use NL-SAS over SATA for vSAN.
After reading this section, it should be clear that the disk controller is a critical component of any vSAN design because it can have an impact on how vSAN performs (and not just the disk controller, but even the firmware and device driver used). When deploying vSAN, it is highly recommended to ensure the driver installed is the version recommended in the vSAN compatibility guide, or upgrade to the latest version as needed. The easiest way to do this is using the vSAN health check, which we will be covering in Chapter 10. The vSAN health check will check the supportability of the disk controller and driver.
vSAN Performance Capabilities
It is difficult to predict what your performance will be because every workload and every combination of hardware will provide different results. After the initial vSAN launch, VMware announced the results of multiple performance tests (http://blogs.vmware.com/vsphere/2014/03/supercharge-virtual-san-cluster-2-million-iops.html). The results were impressive, to say the least, but were only the beginning. With the 6.1 release, performance of hybrid had doubled and so had the scale, allowing for 8 million IOPS per cluster. The introduction of all-flash however completely changed the game. This allowed vSAN to reach 45k IOPS per diskgroup, and remember you can have 5 per disk group, but it also introduced sub millisecond latency. (Just for completeness sake, theoretically it would be possible to design a vSAN cluster that could deliver over 16 million IOPS with sub millisecond latency using an all-flash configuration.)
Do note that these performance numbers should not be used as a guarantee for what you can achieve in your environment. These are theoretical tests that are not necessarily (and most likely not) representative of the I/O patterns you will see in your own environment (and so results will vary). Nevertheless, it does prove that vSAN is capable of delivering a high-performance environment. At the time of writing the latest performance document available is for vSAN 6.0, which can be found here: http://www.vmware.com/files/pdf/products/vsan/VMware-Virtual-San6-Scalability-Performance-Paper.pdf. We highly recommend however to search for the latest version as we are certain that there will be an updated version with the 6.2 release of vSAN.
One thing that stands out though when reading these types of papers is that all performance tests and reference architectures by VMware that are publicly available have been done with 10 GbE networking configurations. For our design scenarios, we will use 10 GbE as the golden standard because it is heavily recommended by VMware and increases throughput and lowers latency. The only configuration where this does not apply is ROBO (remote office/branch office). This 2-node vSAN configuration is typically deployed using 1 GbE since the number of VMs running is typically relatively low (up to 20 in total). Different configuration options for networking, including the use of Network I/O Control, are discussed in Chapter 3, “vSAN Installation and Configuration.”
Now that we have looked at the constraints and some of the performance aspects, let’s take a look at how we would design a vSAN cluster from a capacity point of view.
Design and Sizing Tools
Before we look at our first design, we would like to point you to various tools that can help with designing and sizing your vSAN infrastructure, our preferred tool is the vSAN TCO and sizing calculator. It has been around for a while and can be found at https://vsantco.vmware.com/. The user interface is straight forward, which is shown in Figure 9.5.
Figure 9.5 - Official VMware vSAN sizing calculator
This official VMware vSAN sizing calculator enables you to create a design based on certain parameters such as the following:
- Number of VMs
- Size of VM disks (VMDKs)
- Number of VMDKs
- Number of snapshots
- Read versus write I/O ratio
We highly recommend that you use this tool to validate your design scenarios and decisions to ensure optimal performance and availability for your workloads, with http://vsantco.vmware.com being the preferred solution as the official VMware-supported sizing calculator.
Scenario 1: Server Virtualization—Hybrid
When designing a vSAN environment, it is of great importance to understand the requirements of your VMs. The various examples here show what the impact can be of certain decisions you will need to make during this journey. As with any design, it starts with gathering requirements. For this scenario, we work with the following parameters, which is based on averages provided by the (fictional) customer:
- 1.5 vCPU per VM on average
- 5 GB average VM memory size
- 70 GB average VM disk size
- 70% percent anticipated VM disk consumption
- Fault tolerance method: Performance (RAID-1)
- Number of failures to tolerate (FTT): 1
In this environment, which is a green field deployment for a service provider, it is expected that each cluster will be capable of running 1000 virtual machines. Although a vSAN cluster is capable of running more than a 1000 VMs, we want to make sure that we keep the failure domain relatively small and also provide the ability to scale-out the cluster when required.
This means that our vSAN infrastructure should be able to provide the following:
- 1000 × 1.5 vCPU = 1500 vCPUs
- 1000 × 5 GB = 5000 GB of memory
- 1000 × 70% of 70 GB = 49,000 GB net disk space
We will take a vCPU-to-core ratio of 8:1 into consideration. Considering we will need to be able to run 1500 vCPUs, we will divide that by 8, resulting in the need for 188 cores.
Now let’s look at the storage requirements a bit more in depth. Before we can calculate the requirements, we need to know what level of resiliency will be taken into account for these VMs. For our calculations, we will take a fault tolerance method of RAID-1 (mirroring) and a number of failures to tolerate setting of 1 into account. We also add an additional 30% disk space to cater for metadata and the occasional snapshot. If more disk space is required for snapshots for your environment, do not forget to factor this in when you run through this exercise.
The formula we will use taking the preceding parameters into account looks like the following:
(((Number of VMs *Avg VM size) + (Number of VMs *Avg mem size)) * FTT+1) + 30% slack space
We have included the average memory size because each VM will create a swap file on disk that is equal to the size of the VM memory configuration. Using our industry standard averages mentioned previously, this results in the following:
(((1,000* 49) + (1,000 *5)) *2) = (49,000 + 5,000)* 2 = 108,000 GB + 30% = 140,400 GB
Divide the outcome by 1,024 and round it up; a total combined storage capacity of 138 TB is the outcome. Now that we know we need 138 TB of disk capacity, 5000 GB of memory and 188 cores, let’s explore the hardware configuration.
Determining Your Host Configuration
We will start with exploring the most common scenario, a 2U host. For this scenario, we have decided to select the Dell R730XD, which is depicted in Figure 9.6. This server has been optimized for storage capacity and hold up to 28 devices in a 2U system including 4 NVMe devices. The Dell R730XD is a dual-socket server and can hold up to 1.5 TB of memory. The dual-socket can be configured with anything ranging from a 4-core to an 18-core CPU. You can find more details on the Dell R730XD at http://www.dell.com/us/business/p/poweredge-r730xd/pd.
Figure 9.6 - Dell R730XD server
Our requirements for the environment are as follows:
- Three hosts minimum (according to vSAN)
- 138 TB of raw disk capacity
- 188 CPU cores
- 5000 GB of memory
- Minimal overcommitment in failure scenarios (N+1 for high availability [HA])
As stated, the Dell R730XD can hold 18 cores per socket, at most resulting in a total of 36 cores per host in a dual-socket configuration. This means that from a CPU perspective, we will need roughly six hosts. However, because we have a requirement to avoid overcommitment in a failure scenario, we need a minimum of seven hosts. From a memory perspective, each host today (using 32 GB DIMMs) can be provisioned with 768 GB of memory. Considering we require 5,000 GB of memory, we would need seven hosts when it comes to memory. Because we want to have the ability to scale up as needed and have some head room as well and select an optimal configuration from a cost perspective, 512 GB per host is what has been decided on leading to a minimum of 10 hosts per cluster. Which also seems to be the most cost effective configuration from a CPU point of view as cheaper 10 or 12 core CPUs can be used versus top of the line 18 core CPUs.
Of course, price is always a consideration, and we urge you to compare prices based on CPU, memory, and disk configurations. We will not discuss pricing in-depth as this changes so fast, and by the time we are done with this chapter they will have changed again. In this scenario, we have come to the conclusion that 10 hosts was optimal from a CPU and memory point of view, and we will cater for the disk design based on this outcome.
Storage sizing is a bit more delicate. Let’s look at our options here. We know we need 138 TB of storage, and we know that we need ten hosts from a compute point of view. Considering we have the option to go with either 3.5-inch drives or 2.5-inch drives, we will have a maximum of 240 × 2.5-inch drive slots or 120 × 3.5-inch drive slots total combined for our 10 hosts at our disposal. As each group of disks (seven disks at most in a disk group) will require one flash device, this should be taken into account when deciding the type of disk.
One critical factor to consider is the number of IOPS provided by both magnetic disks and flash devices. Typical 3.5-inch 7200 RPM NL-SAS drives can deliver roughly 80 IOPS, whereas a 2.5-inch 10K RPM SAS drive can deliver 150 IOPS. Do note that in some cases drives with more drive platters and heads may deliver slightly higher IOPS, in some cases NL-SAS drives have been witness to deliver up to 225 IOPS. (IOPS numbers are taken from http://www.techrepublic.com/blog/the-enterprise-cloud/calculate-iops-in-a-storage-array/) From a capacity perspective, NL-SAS drives range from 1 TB up to 6 TB, whereas SAS drives are limited today to 1.2 TB (with vSAN at least). We know we require 138 TB, so let’s do some quick math to show the potential impact a decision like this can have. We will use most commonly used disk types for both SAS and NL-SAS to demonstrate the potential impact of going with one over the other:
- 138 TB/4 TB = ~ 35 NL-SAS magnetic disks = 2800 IOPS from NL-SAS-based magnetic disks.
- 138 TB/1.2 TB = ~ 115 SAS magnetic disks = 17250 IOPS SAS-based from magnetic disks.
As you can see, a huge difference exists between the two examples provided. Although vSAN has been designed to leverage your flash device as the primary provider of performance, it is an important design consideration because these IOPS are used when data needs to be destaged to magnetic disks or when a read cache miss occurs and the data block has to be retrieved from magnetic disk. In this scenario though, we have decided to use NL-SAS drives as it is more cost effective and will compensate for the loss in performance with a larger flash device.
Let’s do the math. In total, as described earlier, we need 35 × 4 TB NL-SAS drives across 10 hosts. This results in a total of 3.5 drives per host, which we will round up to 4.
To ensure that our vSAN cluster provides an optimal user experience, we will use the rule of thumb of 10% flash of anticipated consumed VM disk capacity. In our scenario, we will have a maximum of 1000 VMs. These VMs have a virtual disk capacity of 70 GB, of which it is anticipated that 70% will be actually consumed. This results in the following recommendation from a flash capacity point of view:
10% of (1000 VMs * (70% anticipated consumed of 70 GB)) = 4900 GB
Considering we will require 10 hosts, this means it is recommended to have 4900 GB/10 = ~490 GB of flash capacity in each host. Note that in our configuration we only have one disk group per host, if we would have two disk groups then we need to take this in to consideration as each disk group will also need to have its own flash device. To ensure we can compensate for the lower number of IOPS the capacity tier can deliver, we have decided to use an 800 GB SSD which will provide us with an additional 310 GB of cache capacity
Using the VMware compatibility guide, we have determined that, for our Dell configuration, we could leverage the following flash devices when taking a performance requirement of a minimum of 20,000 writes per second. (Class D and E qualify for this.) The options we have at our disposal currently are as follows:
- 800 GB SSD SAS Mix Use 12 Gbps
- 800 GB SSD SAS Read Intensive 12 Gbps
- 800 GB SSD SATA Mix Use 6 Gbps
- 800 GB SSD SATA Read intensive 6 Gbps
Note that there is a big price difference in some of these cases, and in our scenario the decision was made to go with the 800 GB SSD SATA Mix Use 6 Gbps flash devices.
The final outstanding item is the disk controller. VMware recommends using a pass-through controller, and Dell offers two variants for the R730XD: the H330 and the H730. The H330 is a standard pass-through controller. The H730 offers advanced functionality such as caching, self-encrypting drives, and other functionality. At the time of writing only the H730 controller is supported with vSAN and as such that controller is selected.
The final configuration is ten Dell R730XD (3.5-inch drive slot) hosts consisting of the following:
- Dual-socket—12-core E5-2670
- 512 GB memory
- Disk controller: Dell H730
- 4 × 4TB NL-SAS 7200 RPM
- 1 × 800 GB SSD SATA (Mixed Use)
Scenario 2—Server Virtualization—All-flash
In this scenario, we take a slightly different approach than we did in scenario 1. We will use a different type of host hardware form factor, and we use different sizing and scaling input and we will go for an all-flash configuration instead as there are some interesting design considerations around that as well. For this scenario, we will work with the following parameters:
- 2 vCPU per VM on average
- 8 GB average VM memory size
- 20% memory over commitment
- 150 GB average VM disk size
- 100% anticipated consumed VM disk capacity
- Fault tolerance method: Capacity (RAID-5/6)
- Number of failures to tolerate (FTT) = 2
In the environment, we currently have 200 VMs and need to be able to grow to 300 VMs in the upcoming 12 months. Therefore, our vSAN infrastructure should be able to provide the following:
- 300 × 2 vCPU = 600 vCPUs
- 300 × 8 GB = 2,400 GB of memory – 20% = 1920 GB
- 300 × 150 GB = 45,000 GB disk space
In our sizing exercise, we look at two different server platforms. Our VMs are not CPU intensive but rather memory intensive. We take a vCPU-to-core ratio of 8:1 into consideration. Considering that we will need to be able to run 600 vCPUs, we divide that by 8, resulting in the need for a minimum of 75 cores. From a memory perspective 1920 GB is required and from a storage point of view we need 45 TB, but that is without taking FTT in to account and without any potential savings from deduplication and compression. Before we go through the full design, lets take a look at storage first and the impact of certain decisions.
For storage the requirements are fairly straightforward. If we were going to implement RAID-1 (mirroring), this would mean that by default there would be a 3× storage capacity required. In other words, each 10 GB VM capacity requires 30 GB disk capacity. This is because with RAID-1 and FTT=2 you need two full copies of the data in order to withstand two failures. However, as of vSAN 6.2 it is possible to set a failure tolerance method (FTM) where you have the option to select RAID-1 or RAID5/6. The overhead for RAID-5/6 is significantly lower than in a RAID-1 configuration. The overhead is 1.5× for FTT=2 (RAID-6) and 1.33× for FTT=1 (RAID-5) instead of the 3× overhead that RAID-1 has. In this scenario that means the following:
45,000 GB with FTT=2 and FTM=RAID6 = 1.5 * 45,000 = 67,500 GB
There is a second option to optimize for capacity that can be used on top of RAID-5/6 on all-flash vSAN. This is deduplication and compression. Note that deduplication and compression are enabled or disabled at the same time and the domain for this is the disk group. Results of deduplication and compression will vary per workload type. For a database workload a 2:1 reduction in capacity is a realistic number, whereas virtual desktops are closer to 8:1. What impact does that have in this environment?
In this environment we will be running a mixed server workload ranging from print servers to database servers and anything in between. From a deduplication point of view this will mean that it will not be as efficient as when you would have 300 similar VMs. As such we take a 4:1 data reduction in to account. Then there are some other considerations we need to take into account. For instance, we want to be able to rebuild data in the case of a full host failure, even during maintenance. Our requirements are:
- FTT=2 and FTM=RAID-6 45,000 GB * 1.5 = 67,500 GB.
- With a 4:1 data reduction total required capacity is 16,875 GB.
- 10% cache to capacity ratio = 10% of 45,000 GB = 4500 GB.
- Minimum of six hosts to support RAID-6.
- Additional host to allow for full recovery and re-protection (self-healing) after a failure, which means seven hosts minimum.
- Additional host to allow for recovery and re-protection during maintenance, which means eight hosts minimum.
- Assuming one host failure and maintenance mode not impacting recovery we divide 16,875 GB by 6, which means we need 2,812.50 GB per host including in the two additional hosts.
- For caching this means we need “4,500 GB/8” per host, which is 562 GB.
- CPU requirement for the environment is 75 cores, divided by six host (assuming one host failure and maintenance mode not impacting performance) means we need 12.5 cores per host.
- Memory requirement is 1,920 GB, divided by six hosts is 320 GB of memory per host.
So let’s list the outcome to have it crystal clear, each host should have the following resources:
- CPU: 12.5 cores per host
- Memory: 320 GB per host
- Capacity: 2,812.50 GB
- Cache: 750 GB
One thing that stands out immediately is the relatively high amount of cache compared to the amount required for the capacity tier. The reason for this is that deduplication and compression takes place when destaging data from the caching tier to the capacity tier, as such the caching tier does not benefit from the reduction. Also, one thing to note is that the caching tier is only used as a write buffer in an all-flash configuration, which means that endurance (wear and tear) is a factor to take into consideration; as such we still recommend the 10% rule.
For this configuration the Supermicro 2U 4-Node TwinPro2 server was selected. This platform provides four nodes in a 2U chassis. We need eight nodes in total to meet the requirements around availability, recoverability and performance. The model is the SuperServer 2028TP-HC0TR which is shown in Figure 9.7.
Figure 9.7 - Supermicro SuperServer
We will need two units and have decided to use the following configuration which is a great balance between cost and performance:
- Dual six-core Intel E5-2620 v3 2.4 GHz
- 384 GB of memory per host
- SLC SATADOM for booting ESXi
- LSI 3008 SAS controller
- Caching tier: 1 x Intel S3710 – 800 GB
- Capacity tier: 4 x Micron M510DC – 960 GB
We want to leverage as much cache capacity as possible and as such we have purposely included a larger device than vSAN can utilize at the time of writing. vSAN 6.2, and earlier releases, has a limit of 600 GB cache capacity it can utilize. This also means that at the time of maintenance, or during a failure, it may seem that we are slightly undersized. However, as already mentioned, in an all-flash configuration the most critical aspect of the write cache is the endurance. Additional capacity, beyond 600 GB, will be used by the device for endurance and garbage collection and as such will be very welcome for the caching tier in an all-flash configuration as this. Another decision could be to create a smaller disk group, however it should be noted that a smaller disk group could lead to lower deduplication efficiency and smaller write cache devices with no spare capacity may also have a lower life span as a result of lower endurance.
As this chapter has demonstrated, it is really important to select your hardware carefully. Selection of a disk type, for instance, can impact your potential performance. Other components, such as disk controllers, can have an impact on the operational effort involved with vSAN.
Think about your configuration before you make a purchase decision. As mentioned earlier, VMware offers a great alternative in the vSAN ready node program. These servers are optimized and configured for vSAN, taking away much of the complexity that comes with hardware selection.