Run baby run! High Availability for business critical applications in virtualized environments

Most of you are on a journey to a software defined data center. Some of you used virtualization to consolidate infrastructure to reduce capital expanses. Some of you may be virtualizing (or starting to think virtualizing) business applications to take advantage of the agility and flexibility that virtualization brings. Naturally, one thing you may be worried a lot is system and application availability if you have reached that part of the journey.

The good news is that VMware is not a stranger to HA. VMware vSphere includes a feature named vSphere HA (formerly VMware HA) that protects VMs against hardware failures.  Two or more ESXi hosts can form an HA cluster. vSphere HA provides the following values.

  1. Decent protection against hardware failures (ESXi host failures). When a host fails, the virtual machines on that host can be restarted on another ESXi host sharing the same data store.
  2. Limited protection against guest OS failures. The VMware tools running on guess operating system sends heartbeats to vSphere HA. If heartbeat stops (e.g. the guest operating system is hung), vSphere HA can restart the VM on the same or on a different ESXi host.

The use of vSphere HA depends on the service level agreement (SLA) between IT department and business unit. In most development/test workloads, vSphere HA is good enough as the services can be resumed in less than 10 minutes. The main bottleneck here is the time it takes to reboot the guest operating system.

Another solution is vSphere Fault Tolerance (vSphere FT).  It creates and maintains an additional copy of the VM being protected. It provides continuous availability by ensuring that the states of the primary and secondary VMs are identical at any point in the instruction execution of the virtual machine. However, vSphere FT is not for everyone. Although its protection against hardware failures is impeccable, its protection against OS and applications misbehavior is extremely limited. The cost of operating two virtual machines (and related storage) and other limitations like lack of support for vStorage APIs makes vSphere FT suitable for very limited use cases.

Both vSphere HA and vSphere FT lacks something quite important when it comes to protecting business critical workloads, viz. application awareness. Let us say that you are running an instance of Oracle with a few databases inside a virtual machine. What happens if an Oracle instance fails? What happens of an instance loses access to underlying storage? Neither VMware HA nor FT detects it and hence downtime will be incurred. Downtime = Lost revenue.

There is another weakness in vSphere HA and vSphere FT solutions. It does not protect applications against planned downtimes. When you need to patch, upgrade or perform any other maintenance task related to components within the guest (operating system binaries, application binaries etc.) you must shutdown the application that may be costly for tier 1 business critical applications.

ScenariovSphere HAvSphere FT
Detect host failureVMs are restarted on another host (Recovery time = restart time)The VM executing instructions in lockstep on surviving host takes over (Recovery time is near zero)
Detect VM failure (VM not sending heart beats, OS hung) VM is restartedNo protection likely as both VMs are in lockstep
Detect Application FailureNo ProtectionNo Protection
Compatibility with vMotionYesYes
Compatibility with vStorage APIs for Data Protection (VADP)YesNo (in guest backup agent required)
Avoiding Planned Downtime (patching, upgrades etc.) Planned downtime cannot be avoidedPlanned downtime cannot be avoided

Symantec has solutions to tackle these types of scenarios. One was jointly developed with VMware. The second one comes from a time-tested solution that was ported to support vSphere platform. Let us look at each of them in another blog.