In the past I’ve done quite a bit of work around VMware’s vSphere Metro Storage Cluster (vMSC so I don’t have to write it out constantly, it's hard enough to say!), even before it had a name and was officially supported. It’s a great solution where high availability becomes the defacto replacement for what most organisations do for disaster recovery. One of the big limitations however is that it’s restricted to 2 sites, which means for best practice protection you still need a 3rd site for disaster recovery. Adding cold (or even warm) disaster recovery to a traditional vMSC just adds more complications, restrictions and overheads. I’m not going to spend too much time highlighting the problems with “other” solutions, more spend some time showing how Hedvig handles vMSC and reference the VMware paper
I break down the main concepts of VMware vSphere Storage Cluster in this 12-minute whiteboard video
To get into things a bit deeper, I must first mention there are some critical criteria from VMware which are relevant to the storage infrastructure and need to be met before VMware will even consider the solution as workable. In reality there is a little flexibility on what technically works, but please stick to the VMware guidelines (page 4 from the VMware paper):
- Storage connectivity using Fibre Channel, iSCSI, NFS, and FCoE is supported.
- The maximum supported network latency between sites for the VMware ESXi management networks is 10ms round-trip time (RTT).
- vSphere vMotion, and vSphere Storage vMotion, supports a maximum of 150ms latency as of vSphere 6.0, but this is not intended for stretched clustering usage.
- The maximum supported latency for synchronous storage replication links is 10ms RTT. Refer to documentation from the storage vendor because the maximum tolerated latency is lower in most cases. The most commonly supported maximum RTT is 5ms.
- The ESXi vSphere vMotion network has a redundant network link minimum of 250Mbps.
The key point on vMSC according to the VMware paper “A vSphere Metro Storage Cluster requires what is in effect a single storage subsystem that spans both sites.”. So this is an easy one to address with Hedvig, we are a single cluster with a single storage construct that can span multiple locations (either/both rack and data centre). Data is replicated against the given policy (an example relevant to vMSC might be 2 copies in the local data centre and 2 copies to the remote data centre).
The next key point in VMware paper is “The storage subsystem for a vMSC must be able to be read from and write to both locations simultaneously.”, so how does Hedvig handle this? Turns out very well.
Firstly lets wind back to understand how Hedvig provides storage for VMware (as usual the NFS way is a little cooler). Hedvig uses a storage proxy on each of the ESXi hosts, we call this a Storage Proxy. This acts as the proxy from virtual machines to the backend storage, as well as providing a front-end cache for reads (and some writes, but I can explain that later). A datastore will be presented up to the VMware cluster as normal and VM’s will be created as normal. Behind the scene’s the Hedvig cluster carves up the datastore into what we call containers (you can think of these as similar to a shard in NoSQL, or a block in a traditionally RAID based system). The Storage Proxy understands the layout of the cluster, so requests to a certain area of the datastore are directed to the correct area in the Hedvig cluster, minimising any I/O latency. The datastore within VMware is configured on an internal network with the same IP address across all hosts. This means that all systems see the same datastore, but the control plane for this is distributed (add more nodes, get more cache and storage processing power). Now the cool bit about NFS is that we don’t carve this out as a single datastore, what we actually do is create a child vDisk (Hedvig object of storage allocation) for each VMDK within the main vDisk for the datastore. This means that a VMDK is fully understood and controllable entity within the Hedvig storage cluster.
So back to vMSC, how does the above benefit the requirement to “be able to read from and write to both locations simultaneously”? Well this is the magic of the Storage Proxy on the ESXi hosts. All reads and writes are “proxied” locally through the Storage Proxy which is then intelligently able to read and write to the most optimal part of the cluster for that given workload. A VM is always reading and writing locally, but from a storage system perspective this is a full scale-out cluster spanning multiple locations. As a VM moves across a cluster, it still always just accesses the data locally, regardless of where it is physically located. So move a VM from site 1 to site 2, and the data access layer is actually irrelevant as the data is still just accessed locally (even moving to site 3, 4, 5 or even 6!). The Storage Proxy still finds the most optimal access path for this, so we minimise any unnecessary cross-site traffic and network trumboning at the storage layer.
Uniform versus nonuniform vMSC configurations
Hedvig is absolutely a uniform configuration. As data is routed always locally through the Storage Proxy, it is always uniform access. We strongly recommend having even number of copies in each datacentre so that you aren’t waiting for remote writes to be confirmed and we can then lose either site and still be fully functional without any performance degradation or rebuild waits. But given the VMware limitation of a 10ms network, this doesn’t add a huge penalty if you choose to go with 3 copies or push things to 4 or more data centres. I agree 10ms write latency is a bit much, but it is tolerable for most applications and read latency is always going to be local anyway (and as hinted before we can off-set some of our writes too).
Permanent device loss and all paths down scenarios
This is an interesting concept in Hedvig as theoretically the chance of this happening is incredibly remote. As all storage access from a VM perspective is done to a local datastore (the Storage Proxy), there are no paths that can go down. The Storage Proxy is a clustered system with 2 VMs running lockstep, so the chance of device loss is practically negligible. So APD should never happen, and PDL would suggest that the host has had a failure itself and we would want to failover from that ESXi host to another one. The location of the failover is irrelevant as PDL does not suggest the cluster is gone, only that a local Storage Proxy has no storage to present (likely though that it has been disconnected from the Hedvig cluster for some reason). As the Hedvig cluster is multiple nodes, it is fundamental to the design that an entire Hedvig site cannot disappear in isolation from the applications accessing from that same site. If an entire data centre goes offline, then we leverage the VMware configuration of HA to recover those VMs to the remaining data centres.
This is important not because of anything at the storage layer, but as part of your HA application design. Again as the storage is always access locally, there are no considerations with the DRS configuration that will impact the Hedvig cluster or availability, we don't have a write preference as many other solutions would. The only reason to consider this is to maintain clustered applications within their own sites. As discussed at great length elsewhere, vMSC leverages HA to recover a site failure, which always results in 1-2 minutes + of outage while things are recovered. If you have clustered virtual machines (Exchange, SQL, web farms, etc.) you want to make sure there is site affinity configured to prevent all your clustered components existing in the same site and failing together during a site failure. This is just good practice and as I say irrelevant of the Hedvig configuration.
vSphere Storage DRS
This is not really specific to vMSC but there is a general best practice for Hedvig here. Unless you have multiple Hedvig clusters, Storage DRS is irrelevant for us. We use a single scale-out storage cluster at the back-end and load is evenly distributed across this. Using Storage DRS to balance the I/O is a redundant operation. The only thing that may have some benefit is to vMotion a machine from one host to another if you have excessive I/O on a single Storage Proxy, but as all Storage Proxy's present the same datastores, Storage DRS won’t understand this or know how to balance this across a cluster, so there is a little manual design to make sure high I/O and latency sensitive machines are not hosted on the same physical host.
Failure scenario #1: split brain (storage or data centre partitioning)
Once again this is much more of an issue on the application side and needs careful consideration there. As Hedvig writes are local to the VM, we can happily survive a split brain scenario as we have the confidence that a Storage Proxy will not be attempted to write across the data centres to a remote location, potentially causing out-of-sync writes when the data centres are split. There is some consideration on the application side, but this is something important to design anyway in terms of distributed writes. Traditional SQL databases have a write master and generally have a witness or majority set (odd number of nodes) configuration to recover from a split brain in their space. NoSQL style databases have had to handle this by their design, so long as you have designed correctly (search CAP theorem for more details, great read if you struggle to sleep).
Failure scenario #2: full storage failure and permanent device loss
As previously mentioned, the chances here as so remote as to be totally negligible for Hedvig due to the scale-out design of the cluster. A full storage failure would generally imply that a full data centre has failed, or something much more critical / substantial has gone wrong (such as a network outage). A network outage would force a full storage failure which would mean that the Hedvig Storage Proxy will start writing across the data centre, so while nothing at the application side has actually failed, some writes may have increased latency.
Failure scenario #3: 3+ site vMSC
While this isn’t part of the VMware design for vMSC, this is absolutely possible with Hedvig, and could reduce the reserved capacity while increasing the redundancy. The main overhead to 2-site vMSC is that you best practice need to reserve 50% capacity in the compute tier to allow you to fail an entire site in order to recover to the second. Additionally, if you have any site maintenance or outages you are immediately at risk of additional failures leading to outages. Finally is that 2 locations is not sufficient for most data protection policies, so you still need a 3rd copy replicated elsewhere, usually using a completely different policy and management tier to the vMSC protection policy.
With Hedvig it is a single policy to have a replication factor of 3 or more, and from the same system you can configure snapshots and have an additional layer of data protection. Couple this with the Storage Proxy providing local reads and write proxied, you can happily scale across multiple sites. By having more than 2 sites you minimise the compute failure capacity required, at 3 this is 33%, 4 is 25%, and so on.
Additionally, due to the way the replication factor is handled, this is actually a lot more optimal than a traditional synchronous array replication model. With a traditional synchronous array you need to have your normal data protection policy at both site 1 and site 2, effectively doubling your storage overheads, but Hedvig gives you the ability to share this protection overhead across multiple sites. Hedvig can also be scaled in asymmetrical capacities without any real considerations, you can have a biased towards a single site that is heavier than the other one or two without having a complicated storage setup. This cannot be done with most synchronously replicated arrays as they are generally a one-to-one replication factor (after RAID overheads on each site).
Once again with Hedvig it is all about choice rather than compromise. We allow you to bend and break the traditional limitations and restrictions. 3-site HA sounds pretty awesome to me, and removes some pretty costly DR licenses (*cough*VMwareSRM*cough*OracleDataguard*cough*). Don't forget you can also apply the same Hedvig stretched cluster logic to other hypervisors that maybe haven't traditionally supported a stretched cluster.
I love this solution, please feel free to contact me or the Hedvig team if you want a more detailed overview or a workshop to fit it to your specific requirements. Request a demo or learn more by clicking below.