What’s up with VADP backups and VDDK on vSphere 5.1?

VMware vSphere 5.1 has been in the market for more than a few months now and the interest in the new capabilities is high. Because of this the market saw many backup vendors rush to announce support for vSphere 5.1 in their VADP (vStorage APIs for Data Protection) integration. Everything looked clean and shiny and new.

On November 21, Symantec made an interesting announcement1. In a nutshell, the statement was that support for vSphere 5.1 would be delayed in its NetBackup and Backup Exec products. It was because they discovered issues while testing the VADP 5.1 API for integration. The API in the current form may introduce risk in performing consistent backups and ensuring reliable restores. All vendors receive the same API, not all vendors perform the same level of testing.

In order to explain the intricacies, first we need to take a quick look at how a backup product is integrated with VMware vSphere. With each release of vSphere, VMware publishes a set of APIs known as VMware APIs for Data Protection or VADP. One of the key components of VADP is Virtual Disk Development kit aka VDDK. This is the component through which third party code receives authenticated access to vSphere Datastores and virtual machine disk files. VMware makes this component available to its technology partners. Partners (backup product vendors in this case) ship this along with their product that has calls to vStorage APIs.

With each version of vSphere, an equivalent version of VDDK is released. The VDDK is generally backward compatible to one or more earlier versions of vSphere. For example, VDDK 5.1 supports2 vSphere 5.1, 5.0 and 4.1. VDDK 5.0 supports3 vSphere 5.0, 4.1, 4.0 and VI 3.5. Since the updated VDDK is required to understand the modified data structures in a new version of vSphere, lower versions of VDDK are in general not supported for accessing a higher version of vSphere. For example, VMware historically and currently (as of today) does not support the use of VDDK 5.0 to access datastores in vSphere 5.1.  VMware documents supported versions of vSphere for each of its VDDK versions in release notes.

The key to remember is the statement in bold face above. VMware does not support any violated combinations because of the risks and uncertainties. The partners are expected to ship the correct version of VDDK when they announce the availability of support for a given vSphere release.

What Symantec announced and VMware confirmed4 is that VDDK 5.1 has issues and hence the support for vSphere 5.1 in its products will be delayed. This makes sense since VDDK 5.1 is the only version currently allowed to access vSphere 5.1. The face-saving reactions from other vendors to this announcement revealed some of the dirty games and ugly truths to come out in the area of VADP/VDDK integration.

 

  1. Vendors were claiming support for vSphere 5.1 but still shipping VDDK 5.0 with their products. This is currently not supported by VMware because of the uncertainties.  This may change but at the time vendors claiming support, they were taking risks that typically are not acceptable in field of data protection business.
  2. Vendors were mucking with API calls and silently killing hung processes. That may work for an isolated or random hang. But will not work when there are repeatable hang situations like those observed in VDDK 5.1. Plus, there are performance and reliability concerns in abruptly ending sessions with vSphere.
  3. Most vendors weren’t testing all the edge cases and never realized the problems in VDDK 5.1, thus prematurely announcing support for 5.1

 

If your backup vendor currently supports vSphere 5.1, be sure to ask what their situation is.

Sources and references:

1. Quality wins every time: vSphere 5.1 support update, Symantec official blog.

2. VDDK 5.1 Release Notes, VMware Support resources

3. VDDK 5.0 Release Notes, VMware Support resources

4. Third-party backup software using VDDK 5.1 may encounter backup/restore failures, VMware Support KB

vSphere changed block tracking: A powerful weapon for backup applications to shrink backup window

Changed block tracking is not a new technology. Those who have used Storage Foundation for Oracle would know that VERITAS file system (VxFS) provides no-data check points which can be used by backup applications to identify and backup just the changed blocks from the file systems where database files are housed.  This integration was in NetBackup since version 4.5 that was released 10 years ago! It is still used by Fortune 500 companies to protect mission critical Oracle databases that would otherwise require a large backup window with traditional RMAN streaming backups.

VMware introduced change block tracking (CBT) since vSphere 4.0 and is available for virtual machines version 7 or higher. NetBackup 7.0 added support for CBT right away. Backing up VMware vSphere environments got faster. When a VM has CBT turned on, it can track changes to virtual machine disk (VMDKs) sectors.  Its impact on VM performance is marginal. Backup applications with VADP (vStorage APIs for Data Protection) support can use an API (named QueryChangedDiskAreas) to identify and copy changed blocks from a particular point in time. This time point is identified using an argument named ChangeId in the API call.

VMware has made this quite easy for backup vendors to implement. Powerful weapons can be dangerous when not used with utmost care. An unfortunate problem in Avamar’s implementation of CBT came to light recently. I am not picking on Avamar developers here, it is not possible to predict all the edge cases during development and they are working hard to fix this data loss situation. As an engineer myself, I truly empathize with Avamar developers for getting themselves into this unfortunate situation. This blog is a humble attempt to explain what had happened as I got a few questions from the field seeking input on the use of CBT after the EMC reported issues in Avamar.

As we know, VADP lets you query the changed disk areas to get all the changes in a VMDK since a point in time corresponding to a previous snapshot. Once the changed blocks are identified, those blocks are transferred to the backup storage. The way the changed blocks are used by the backup application to create the recovery point (i.e. backup image) varies from vendor to vendor.

No matter how the recovery point is synthesized, the backup application must make sure that the changed blocks are accurately associated with the correct VMDK because a VM can have many disks. As you can imagine if the blocks were associated with the wrong disk in backup image; the image is not an accurate representation of source. The recovery from this backup image will fail or will result in corrupt data on source.

The correct way to identify VMDK is using their UUIDs which are always unique. Using positional identifies like controller-target-LUN at the VM level are not reliable as those numbers could change when some of the VMDK are removed or new ones are added to a VM. This is an example of disk re-order problem. This re-order can also happen for non-user initiated operations. In Avamar’s case, the problem was that the changed blocks belonging one VMDK was getting associated with a different VMDK in backup storage on account of VMDK re-ordering. Thus the resulting backup image (recovery point) generated did not represent the actual state of VMDK being protected.

To make the unfortunate matter worse, there was a cascading effect. It appears that Avamar’s implementation of generating a recovery point is to use the previous backup as the base. If disk re-order happened after nth backup, all backups after nth backup are affected on account of the cascading effect because new backups are inheriting the base from corrupted image.

This sounds scary. That is how I started getting questions on reliability of CBT for backups from the field. Symantec supports CBT in both Backup Exec and NetBackup. Are Symantec customers safe?

Yes, Symantec customers using NetBackup and Backup Exec are safe.

How do Symantec NetBackup and Backup Exec handle re-ordering? Block level tracking and associated risks were well thought out during the implementation. Implementation for block level tracking is not something new for Symantec engineering because such situations were accounted for in the design for implementing VxFS’s no-data check point block level tracking several years ago.

There are multiple layers of resiliency built-in Symantec’s implementation of CBT support. I shall share oversimplified explanations for two of those relevant in ensuring data integrity that are relevant here.

Using UUID to accurately associate ChangeId to correct VMDK: We already touched on this. UUID is always unique and using it to associate the previous point in time for VMDK is safe. Even when VMDKs get re-ordered in a VM, UUID stays the same. Thus both NetBackup and Backup Exec always associate the changed blocks to the correct VM disk.

Superior architecture that eliminates the ‘cascading-effect’:  Generating a corrupted recovery point is bad. What is worse is to use it as the base for newer recovery points. The corruption goes on and hurt the business if left unnoticed for long time. NetBackup and Backup Exec never directly inject changed blocks to an existing backup to create a new recovery point. The changed blocks are referenced separately in the backup storage. During a restore, NetBackup recreates the point in time during run-time. This is the reason NetBackup and Backup Exec are able to support block level incremental backups even to tape media! Thus a corrupted backup (should that ever happen) never ‘propagates’ corruption to future backups.

Introduction to VMware vStorage APIs for Data Protection aka VADP

6. Getting to know NetBackup for VMware vSphere

Note: This is an extended version of my blog in VMware Communities: Where do I download VADP? 

Now that we talked about NetBackup master servers and media servers, it is time to get into learning how NetBackup Client on VMware backup host (sometimes known as VMware proxy host) protects the entire vSphere infrastructure. In order to get there, we first need a primer on vStorage APIs for Data Protection (VADP) from VMware.  We will use two blogs to cover this topic.

Believe it or not, this question of what VADP really is comes up quite often in VMware Communities, especially in Backup & Recovery Discussions

Backup is like an insurance policy. You don’t want to pay for it, but not having it is the recipe for sleepless nights. You need to protect data on your virtual machines to guard against hardware failures and user errors. You may also have regulatory and compliance requirements to protect data for longer term.

With modern day hardware and cutting edge hypervisors like that from VMware, you can protect data just by running a backup agent within the guest operating system. In fact, for certain workloads; this is still the recommended way.

VMware had made data protection easy for administrators. That is what vStorage APIs for Data Protection (VADP) is. It available since vSphere 4.0 release. It is a set of APIs (Application Programming Interfaces) made available by VMware for independent backup software vendors. These APIs make it possible for backup software vendors to embed the intelligence needed to protect virtual machines without the need to install a backup agent within the guest operating system. Through these APIs, the backup software can create snapshots of virtual machines and copy those to backup storage.

Okay, now let us come to the point. As a VMware administrator what do I need to do to make use of VADP? Where do I download VADP? The answer is…Ensure that you are NOT using hosts with free ESXi licenses.

  1. Ensure that you are NOT using hosts with free ESXi licenses
  2. Choose a backup product that has support for VADP

The first one is easy to determine. If you are not paying anything to VMware, the chances are that you are using free ESXi. In that case, the only way to protect data in VMs is to run a backup agent within the VM. No VADP benefits.

Choosing a backup product that supports VADP can be tricky. If your organization is migrating to a virtualized environment, see what backup product is currently in use for protecting physical infrastructure. Most of the leading backup vendors have added support for VADP. Symantec NetBackup, Symantec Backup Exec, IBM TSM, EMC NetWorker, CommVault Simpana are examples.

If you are not currently invested in a backup product (say, you are working for a start-up), there are a number of things you need to consider. VMware has a free product called VMware Data Recovery (VDR) that supports VADP. It is an easy to use virtual appliance with which you can schedule backups and store it in deduplicated storage. There are also point products (Quest vRanger, Veeam Backup & Replication etc.) which provide additional features. All these products are good for managing and storing backups of virtual machines on disk for shorter retention periods. However, if your business requirements need long term retention, you would need another backup product to protect the backup repositories of these VM only solutions which can be a challenge. Moreover, it is less unlikely to see businesses that are 100% virtualized. You are likely to have those NAS devices for file serving, desktops and laptops for end users and so on.  Hence a backup product that supports both physical systems and VADP are ideal in most solutions.

Although VADP support is available from many backup vendors, the devil is in the details. Not all solutions use VADP the same way. Furthermore, many vendors add innovative features on top of VADP to make things better.  We will cover this next.

Back to NetBackup 101 for VMware Professionals main page

Next: Coming Soon!

Turning cheap disk storage into an intelligent deduplication pool in NetBackup

5. NetBackup Intelligent Deduplication Pool

Deduplication for backups does not need an introduction. In fact, deduplication is what made disk storage a viable alternative for tapes. Deduplication storage is available from several vendors in the form of pre-packaged storage and software. Most of the backup vendors also provide some level of data reduction using deduplication or deduplication-like features.

Often we hear that backups of virtual environments are ideal for deduplication. While I agree with this statement, several articles tend to give the wrong perception when it comes to why it is a good idea.

The general wisdom goes like this. As there are many instances of guest operating systems, there are many duplicate files and hence deduplication is recommended.  A vendor may use this reasoning to sell you the deduplication appliance or to differentiate their backup product from others. This is short-sighted view. First of all, multiple instances of the same version of operating system are possible even when your environment is not virtualized; hence that argument is weak. Secondly, operating system files contribute less than 10% of your data in most virtual machines hosting production applications. Hence if a vendor tells you that you need to group virtual machines from the same template to be on a backup job to make use of ‘deduplication’; what they provide is not true deduplication. Typically such techniques involve simply using the block level tracking provided by vStorage APIs for Data Protection (vADP) combined with excessive compression. Data reduction does not go beyond a given backup job.

Behold NetBackup Intelligent Deduplication. We talked about NetBackup media servers before. Attach cheap disk storage of your choice and turn on NetBackup Intelligent Deduplication by running a wizard. Your storage transforms into a powerful deduplication pool that deduplicates inline across multiple backup jobs. You can deduplicate at the target (i.e. the media server) or you can let it deduplicate at the source, if you have configured a dedicated VMware backup host.

Why is this referred to as an intelligent deduplication pool? When backup streams arrive, the deduplication engine sees the actual objects (files, database objects, application objects etc.) through a set of patent pending technologies referred to as Symantec V-Ray. Thus it deduplicates blocks after accurately identifying exact object boundaries. Compare this to third party target deduplication devices where the backup stream is blindly chopped to guess the boundaries and identify duplicate segments.

   The other aspect of NetBackup Intelligent Deduplication pool is its scale-out capability.  The ability to grow storage and processing capacity independently as your environment grows.  The storage capacity can be grown from 1TB to 32TB  thereby letting you protect 100s of terabytes of backup images. In addition you can add additional media servers to do dedupe processing on behalf of the media server hosting the deduplication storage. The scale out capability can also be established by simply adding additional VMware backup hosts. The global deduplication occurs across multiple backup jobs, multiple VMware backup hosts and multiple media servers. It is scale out in multiple dimensions! A typical NetBackup environment can protect multiple vSphere environments and deduplicate across virtual machines in all of them.

Back to NetBackup 101 for VMware Professionals main page

NetBackup master server and VMware vCenter

3. The control and command center

vCenter server is the center of an enterprise vSphere environment. Although the ESXi hosts and virtual machines will continue to function even when vCenter server is down, enterprise data centers and cloud providers cannot afford such a downtime. Without vCenter server crucial operations like vMotion, VMware HA, VMware FT, DRS etc will cease to function. A number of third party applications count on plug-ins in vCenter server. A number monitoring and notification functions are governed by vCenter. Hence larger enterprises and cloud providers deploy vCenter on highly redundant systems. Some use high availability clustering solutions line Microsoft Cluster Server or VERITAS cluster server. Some deploy vCenter on a virtual machine protected by VMware HA that is run by another vCenter server.

NetBackup master server plays a similar role. It is the center of NetBackup domain. If this system goes down, you cannot do backups or restores. Unlike vCenter server which runs on Windows (and now on Linux) you can deploy master server on a variety of operating systems like Windows, enterprise flavours of Linux, AIX, HP-UX and Solaris. NetBackup includes cluster agents for Microsoft Cluster Server, VERITAS Cluster Server, IBM HACMP, HP-UX Service Guard and Sun/Oracle Cluster for free. If you have any of these HA solutions, NetBackup lets you install master server with an easy to deploy cluster installation wizard.

An enterprise vCenter server uses a database management system, usually Microsoft SQL Server, for storing its objects. NetBackup comes with Sybase ASA which is embedded in the product. This is a highly scalable application database. No need to provide a separate database management system.

In addition to Sybase ASA database, NetBackup also stores backup image metadata (index) and other configurations in binary and ASCII formats. The entire collection of Sybase ASA databases, binary image indexes and ASCII based databases is referred to as NetBackup Catalog.  NetBackup does provide you a specific kind of backup policy called Catalog backup policy to copy the entire catalog to secondary storage devices (disk or tape) easily. Thus even if you lose your master server, you can perform a catalog recovery to rebuild the master server.

In VMware, you might have dealt with vCenter Server HeartBeat. This feature provides you that capability to replicate vCenter configuration to a remote site so that you could start the replicated vCenter server at that site in case of primary site loss. NetBackup goes a bit further. Unlike vCenter HeartBeat which has Active-Passive architecture, NetBackup provides A.I.R (Auto Image Replication). When you turn on A.I.R for your backups, NetBackup appends the catalog information for the backup in the backup image itself. The images are replicated using storage device’s native replication engine. At the remote site you can have a fully functional master server (which is serving to media servers and clients locally). The device on the remote master server domain which receives A.I.R images can automatically notify the remote master. The remote master now imports the image catalog info from storage. Unlike traditional import processes where the entire image needs to be scanned for recreating the catalog remotely, this optimized import finishes in a matter of seconds (even if the backup image was several terabytes) because the catalog info is embedded within the image for quick retrieval. The result is Active-Active NetBackup domains at both sites. They could replicate in both directions and also act as the DR domain for each other. You can have many NetBackup domains replicating to a central site (fan-in), one domain replicating to multiple domains (fan-out) or a combination of both. This is why NetBackup is the data protection platform that cloud pilots need to master. It is evolving to serve clouds which typically span multiple sites.

vCenter integrates with Active Directory to provide role based access control. Similarly NetBackup provides NetBackup Access Control that can be integrated with Active Directory, LDAP, NIS and NIS+. NetBackup also features audit trails so that you can track users’ activities.

One thing that really makes NetBackup stand out from the point solutions like vRanger, Veeam etc is the ability for virtual machine owners (the application administrators) to self serve their recovery needs. For example, the Exchange administrator can use NetBackup GUI on client, authenticate himself/herself and browse Exchange objects in backups. NetBackup presents the objects directly from its catalog or live browse interface depending on the type of object being requested.  The user simply selects the object needed and initiates the restore. NetBackup does the rest. There are no complex ticket systems where the application owner makes a request to backup administrator. No need to mount an entire VM on your precious ESX resources in production just to retrieve a 1k object. No need to learn how to manipulate objects (for example, the need to manually run application tools to copy objects from a temporary VM) and face the risks associated with user errors. All the user interfaces directly connect to master server, it figures out what to restore and starts the job on a media server.

Well, so NetBackup is an enterprise platform that makes a traditional VM administrator to a cloud pilot of the future. It is nice to see that NetBackup has support for various hardware and operating systems. Is there a way to deploy a NetBackup domain without building a master server on my own? The answer is indeed yes!! NetBackup 5200 series appliances are available for you for this purpose. These appliances are built on Symantec hardware and can be deployed in a matter of minutes. Everything you need to create a NetBackup master server and/or media server is available in these appliances.

Next: Coming Soon!

Back to NetBackup 101 for VMware Professionals main page

Deduplication for dollar zero?

One of the data protection experts asked me a question after reading my blog on Deduplication Dilemma: Veeam or Data Domain.

I am paraphrasing his question as our conversation was limited to 140 characters at time through Twitter.

“Have you seen this best practice blog on Veeam with Exagrid? Here is the blog.  It says not to do reverse incremental backups. The test Mr. Attlia ran was incomplete. The Veeam deduplication at the first pass is poor, but after that it is worth it, right?”

These are all great questions. I thought of dissecting each aspect and share it here. Before I do that I want to make it clear that deduplication devices are fantastic for use in backups. These work great with backup applications that really offer the ability to restore individual objects. If the backup application ‘knows’ how to retrieve specific objects from backup storage, target deduplication adds a lot of value.  That is why NetBackup, Backup Exec, TSM, NetWorker and the like play well with target deduplication appliances. Veeam, on the other hand, simply mounts the VMDK file from backup store and asks the application administrator to fish for the item he/she is looking for. This is where Veeam falls apart if you try to deploy it in medium to large environments. Although target deduplication appliances are disk based, they are optimized more for sequential access as backup jobs mostly follow sequential I/O pattern. When you perform random I/O on these devices (as it happens when a VM is directly run from it), there is a limit to which those devices can perform.

Exagrid: a great company helping out customers

Exgrid has an advantage here. It has flexibility to keep the most recent backup in hydrated form (Exagrid uses post-process deduplication) which works well with Veeam if you employ reverse incremental backups. In reverse incremental backups, the most recent backup is always a full backup. You can eliminate the performance issues inherent in mounting the image on an ESX host when the image is being served in hydrated form. This is good from the recovery performance perspective.  However, Exagrid recommends not turning on reverse incremental method because it burdens the appliance during backups. This is another dilemma; you have to pick backup performance or recovery performance (RTO), not both.

Let me reiterate this. The problem is not with Exagrid in this case. They are sincerely trying to help customers who happened to choose Veeam. Exagrid is doing the right thing; you want to find methods to help out customers in achieving ROI no matter what backup solution they ended up choosing. I take my hat off at Exagrid in respect.

Now let us take a closer look at other recommendations from Exagrid to alleviate the pain points with Veeam.

Turn off compression in Veeam and Optimize for Local target:  Note that Exagrid suggested turning off compression and choosing Optimize for Local target option. These settings have the effect of eliminating most of what Veeam’s deduplication offers. By choosing those options, you let the real guy (Exagrid appliance) do the work.

Weren’t Mr Attila’s tests incomplete?

Mr. Attila stopped tests after the initial backup. The advantage of deduplication is visible only on subsequent backups. Hence his tests weren’t complete. However, as I stated in the blog; that test simply triggered my own research. I wasn’t basing my opinions just on Mr. Attila’s tests. I should have mentioned this in the earlier blog, but it was already becoming too big.

As I mentioned in the blog earlier, Veeam deduplication capabilities are limited. Quoting Exagrid this time: “Once the ExaGrid has all the data, it can look at the entire job at a more granular level and compress and dedupe within jobs and also across jobs! Veeam can’t do this because it has data constantly streaming into it from the SAN or ESX host, so it’s much harder to get a “big picture” of all the data.”   

If Veeam’s deduplication is the only thing you have, the problem is not just limited to the initial backup. Here are a few other reasons why a target deduplication is important when using Veeam.

  1. The deduplication is limited to a job. Veeam’s manual recommends putting VMs created from the same template into a single job to achieve that dedupe rate. It is true that VMs created from the same template have a lot of redundant OS files and whitespace so the dedupe rate will be good at the beginning. But these are just the skins or shells of you enterprise production data. The real meat is the actual data which is less likely to be the same across multiple VMs. We are better of giving that task to the real deduplication engines!
  2. Let us say you have a job with 20 production VMs. You are going to install something new on one of the VM, so you prefer to do a one-time backup before making any changes. Veeam requires you to create a new job to do this. This is not only inconvenient, but now you lose the advantage of incremental backup. You have to stream the entire VM again. Can we afford this in a production environment?
  3. Veeam incremental backups are heavily dependent on vCenter server. If you move a VM from one vCenter to another or if you had to rebuild your vCenter (Veeam cannot protect an enterprise grade vCenter running on a physical system, but let us not go there for now), you need to start seeding full backups for all your VMs. For example, if you want to migrate from a traditional vCenter server running 4.x to a vCSA 5.0, expect to reseed all the backups again.

My point is that Veeam deduplication is not something you can count on to protect a medium to large environment with these limitations. It has the price of $0 for a reason.

NetBackup and Backup Exec let you take advantage of target deduplication appliances to the fullest potential. As these platforms tracks which image has the objects the application administrator is looking for, they can simply retrieve those objects alone from backup storage. The application administrator can self-serve their needs, no need for  20th century ticket system! The journey to the Cloud starts with empowering users to self-serve their needs from the Cloud.