What’s up with VADP backups and VDDK on vSphere 5.1?

VMware vSphere 5.1 has been in the market for more than a few months now and the interest in the new capabilities is high. Because of this the market saw many backup vendors rush to announce support for vSphere 5.1 in their VADP (vStorage APIs for Data Protection) integration. Everything looked clean and shiny and new.

On November 21, Symantec made an interesting announcement1. In a nutshell, the statement was that support for vSphere 5.1 would be delayed in its NetBackup and Backup Exec products. It was because they discovered issues while testing the VADP 5.1 API for integration. The API in the current form may introduce risk in performing consistent backups and ensuring reliable restores. All vendors receive the same API, not all vendors perform the same level of testing.

In order to explain the intricacies, first we need to take a quick look at how a backup product is integrated with VMware vSphere. With each release of vSphere, VMware publishes a set of APIs known as VMware APIs for Data Protection or VADP. One of the key components of VADP is Virtual Disk Development kit aka VDDK. This is the component through which third party code receives authenticated access to vSphere Datastores and virtual machine disk files. VMware makes this component available to its technology partners. Partners (backup product vendors in this case) ship this along with their product that has calls to vStorage APIs.

With each version of vSphere, an equivalent version of VDDK is released. The VDDK is generally backward compatible to one or more earlier versions of vSphere. For example, VDDK 5.1 supports2 vSphere 5.1, 5.0 and 4.1. VDDK 5.0 supports3 vSphere 5.0, 4.1, 4.0 and VI 3.5. Since the updated VDDK is required to understand the modified data structures in a new version of vSphere, lower versions of VDDK are in general not supported for accessing a higher version of vSphere. For example, VMware historically and currently (as of today) does not support the use of VDDK 5.0 to access datastores in vSphere 5.1.  VMware documents supported versions of vSphere for each of its VDDK versions in release notes.

The key to remember is the statement in bold face above. VMware does not support any violated combinations because of the risks and uncertainties. The partners are expected to ship the correct version of VDDK when they announce the availability of support for a given vSphere release.

What Symantec announced and VMware confirmed4 is that VDDK 5.1 has issues and hence the support for vSphere 5.1 in its products will be delayed. This makes sense since VDDK 5.1 is the only version currently allowed to access vSphere 5.1. The face-saving reactions from other vendors to this announcement revealed some of the dirty games and ugly truths to come out in the area of VADP/VDDK integration.

 

  1. Vendors were claiming support for vSphere 5.1 but still shipping VDDK 5.0 with their products. This is currently not supported by VMware because of the uncertainties.  This may change but at the time vendors claiming support, they were taking risks that typically are not acceptable in field of data protection business.
  2. Vendors were mucking with API calls and silently killing hung processes. That may work for an isolated or random hang. But will not work when there are repeatable hang situations like those observed in VDDK 5.1. Plus, there are performance and reliability concerns in abruptly ending sessions with vSphere.
  3. Most vendors weren’t testing all the edge cases and never realized the problems in VDDK 5.1, thus prematurely announcing support for 5.1

 

If your backup vendor currently supports vSphere 5.1, be sure to ask what their situation is.

Sources and references:

1. Quality wins every time: vSphere 5.1 support update, Symantec official blog.

2. VDDK 5.1 Release Notes, VMware Support resources

3. VDDK 5.0 Release Notes, VMware Support resources

4. Third-party backup software using VDDK 5.1 may encounter backup/restore failures, VMware Support KB

vSphere changed block tracking: A powerful weapon for backup applications to shrink backup window

Changed block tracking is not a new technology. Those who have used Storage Foundation for Oracle would know that VERITAS file system (VxFS) provides no-data check points which can be used by backup applications to identify and backup just the changed blocks from the file systems where database files are housed.  This integration was in NetBackup since version 4.5 that was released 10 years ago! It is still used by Fortune 500 companies to protect mission critical Oracle databases that would otherwise require a large backup window with traditional RMAN streaming backups.

VMware introduced change block tracking (CBT) since vSphere 4.0 and is available for virtual machines version 7 or higher. NetBackup 7.0 added support for CBT right away. Backing up VMware vSphere environments got faster. When a VM has CBT turned on, it can track changes to virtual machine disk (VMDKs) sectors.  Its impact on VM performance is marginal. Backup applications with VADP (vStorage APIs for Data Protection) support can use an API (named QueryChangedDiskAreas) to identify and copy changed blocks from a particular point in time. This time point is identified using an argument named ChangeId in the API call.

VMware has made this quite easy for backup vendors to implement. Powerful weapons can be dangerous when not used with utmost care. An unfortunate problem in Avamar’s implementation of CBT came to light recently. I am not picking on Avamar developers here, it is not possible to predict all the edge cases during development and they are working hard to fix this data loss situation. As an engineer myself, I truly empathize with Avamar developers for getting themselves into this unfortunate situation. This blog is a humble attempt to explain what had happened as I got a few questions from the field seeking input on the use of CBT after the EMC reported issues in Avamar.

As we know, VADP lets you query the changed disk areas to get all the changes in a VMDK since a point in time corresponding to a previous snapshot. Once the changed blocks are identified, those blocks are transferred to the backup storage. The way the changed blocks are used by the backup application to create the recovery point (i.e. backup image) varies from vendor to vendor.

No matter how the recovery point is synthesized, the backup application must make sure that the changed blocks are accurately associated with the correct VMDK because a VM can have many disks. As you can imagine if the blocks were associated with the wrong disk in backup image; the image is not an accurate representation of source. The recovery from this backup image will fail or will result in corrupt data on source.

The correct way to identify VMDK is using their UUIDs which are always unique. Using positional identifies like controller-target-LUN at the VM level are not reliable as those numbers could change when some of the VMDK are removed or new ones are added to a VM. This is an example of disk re-order problem. This re-order can also happen for non-user initiated operations. In Avamar’s case, the problem was that the changed blocks belonging one VMDK was getting associated with a different VMDK in backup storage on account of VMDK re-ordering. Thus the resulting backup image (recovery point) generated did not represent the actual state of VMDK being protected.

To make the unfortunate matter worse, there was a cascading effect. It appears that Avamar’s implementation of generating a recovery point is to use the previous backup as the base. If disk re-order happened after nth backup, all backups after nth backup are affected on account of the cascading effect because new backups are inheriting the base from corrupted image.

This sounds scary. That is how I started getting questions on reliability of CBT for backups from the field. Symantec supports CBT in both Backup Exec and NetBackup. Are Symantec customers safe?

Yes, Symantec customers using NetBackup and Backup Exec are safe.

How do Symantec NetBackup and Backup Exec handle re-ordering? Block level tracking and associated risks were well thought out during the implementation. Implementation for block level tracking is not something new for Symantec engineering because such situations were accounted for in the design for implementing VxFS’s no-data check point block level tracking several years ago.

There are multiple layers of resiliency built-in Symantec’s implementation of CBT support. I shall share oversimplified explanations for two of those relevant in ensuring data integrity that are relevant here.

Using UUID to accurately associate ChangeId to correct VMDK: We already touched on this. UUID is always unique and using it to associate the previous point in time for VMDK is safe. Even when VMDKs get re-ordered in a VM, UUID stays the same. Thus both NetBackup and Backup Exec always associate the changed blocks to the correct VM disk.

Superior architecture that eliminates the ‘cascading-effect’:  Generating a corrupted recovery point is bad. What is worse is to use it as the base for newer recovery points. The corruption goes on and hurt the business if left unnoticed for long time. NetBackup and Backup Exec never directly inject changed blocks to an existing backup to create a new recovery point. The changed blocks are referenced separately in the backup storage. During a restore, NetBackup recreates the point in time during run-time. This is the reason NetBackup and Backup Exec are able to support block level incremental backups even to tape media! Thus a corrupted backup (should that ever happen) never ‘propagates’ corruption to future backups.

No point in locking the door when walls have fallen

Security for information assets is crucial for business continuity. Corporate information in the wrong hands can compromise the survival of an organization which is why security must be considered wherever that information lives. The last stop for the protection and recovery of this information is backup.  It protects against the loss of information from hardware failures, logical corruptions and human errors.  Unfortunately, the security of backup information can often be overlooked. As a leader in backup and security, Symantec advocates a holistic approach to managing and securing corporate information.

Let us consider a few of the key components for secure enterprise backup solutions.

  • The solution must securely transfer data from source to backup storage.
  • The solution must store the backup image securely.
  • It should offer authentication and authorization built-in to control access.  You don’t want an intruder or a client system masquerading as a production client and retrieving data from backups.
  • It must protect itself completely. If the system hosting the solution suffers a hardware failure, it must be able to recover image metadata, security certificates, access control rules and other important data so that the solution once again controls access to backup images.

While there are more considerations that could be listed, these underscore the broader requirements for an enterprise backup solution.  Backup is much more than moving data as quickly as possible from source to backup storage.  Other functions need to exist to ensure the security of backup images and to prevent the production data embedded in a backup image from falling into wrong hands.

Moving data to the cloud is a great use case to consider in more detail. In an era where organizations are looking to cut operating costs by moving information to the cloud, attention must be paid to the protection of user data from end-to-end, regardless of whether it is on primary storage on a VM hosted for the user or sent to a backup storage.  Choosing a backup solution that fails to deliver security for backup images may result in corporate embarrassment, liability and loss of business.

I came across an interesting post from Mike Beevor who works for Veeam as a Systems Engineer. You can read the details here. Mike creates a nice article by consolidating the scripted responses needed when an IT security team is evaluating the risks in using Veeam for VM backups in a secured environment. Unfortunately, he looks to have left one huge hole unattended. If a tornado knocks down the walls, is there any point in putting locks on rest of the doors?  Let me explain what I mean.

Locking the front door when walls have fallen!
Locking the front door when walls have fallen! Illustration by Scott G.

The failure point in question is the Veeam Backup File(VBK). When production VMs with precious data are backed by Veeam, it stores virtual machine files in a container file with extension vbk . This file is kept on a plain file system on a backup server with direct attached storage, SAN attached array or on a NAS device.

Most production VMs will have one VMDK file for operating system and one or more VMDK files for data. A utility was originally developed to provide users with a way to import a VM using a VBK backup file directly into vSphere. Veeam created this process because the backup solution does not offer a good way to protect itself (see rule number 4 – protect the backup & recovery system).  A person who gets a copy of a VBK file can import the VM in the file onto his/her own ESXi host, detach the data disk and mount it on his/her own VM to get access to production data! Veeam does not provide any sort of encryption for VBK files.  Unfortunately, the only way to recover individual objects from a Veeam backup is to run the entire VM from backup storage.

The lack of security for the container file makes it easy for anyone to retrieve data. The users of Veeam are already concerned about this weakness. Posts in the Veeam forums related to this issue seem to be conveniently moderated out. Here is an example of a post which appeared in a Google search before it was deleted for Veeam 6 launch. Finally a modified response appeared that Veeam will consider this as a future feature.

 

Customers requesting enhanced security for VBK, moderated thread reappeared recently

The user is asking for enhanced VBK security through the use of password protection.  It is a step in the right direction. Hopefully, Veeam will work on this soon.

As you can imagine, this is a huge security hole especially as users virtualize more mission critical applications.  Unfortunately, Veeam 6 makes this problem even worse. Now these VBK files are scattered around multiple repository hosts thereby increasing the chances for exposure.  

What to do if you are currently using Veeam for backups?

  • Enable file system level encryption on all repositories if overall performance after encryption is acceptable.
  • When using a NFS/CIFS based deduplication device for storage (e.g. Data Domain), enable encryption within the device.
  • Make sure that the NFS/CIFS shares are exposed only to proxy servers and Veeam servers. NAS devices’ default export policies are generally read-only access for ‘world’. In the case of VBK files this read-only access is enough to compromise production data in backups.
  • Harden passwords on all Veeam backup servers, repository servers and proxy servers.  Do not use the same password on all repository servers.
  • Talk to your security team; leverage their investments. For example, the security team may have Symantec Critical System Protection suite.  Install CSP agents on backup servers and repository servers to provide non-signature based Host Intrusion Prevention protection. It protects  against zero-day attacks using granular OS hardening policies along with application, user and device controls
  • Consider switching to a backup solution that offers encryption for both data in-flight and data at rest.  In fact, you may already be using another backup solution to backup Veeam backup files. Most backup applications offer vADP integration with VMware. Do you know that Backup Exec is #1 in Veeam backup? Veeam’s Doug Hazelman admitted this to Curtis Preston.

Veeam has done similar things in the past to conveniently hide the root of the problem with deflection techniques. Remember DCIG’s Jerome Wendt who uncovered the real motto behind SureBackup? Learn more about his discovery here.

Naturally, the next question is how a leader like Symantec provides security for backups.  Let us use Symantec’s Backup Exec as an example.

 

  • When making use of deduplication-folder (Backup Exec’s built-in deduplication), it is not possible for an intruder to identify specific backup images even if he gains access to the file system where the folder resides. If he decides to steal the entire folder, he cannot import the images on this folder to an alternate system without having the credentials needed to access this deduplication folder.
  • When sending backups to tape, software and hardware encryption (T10 encryption standard) are supported. Thus you do not have to worry about information getting leaked even if tapes are stolen.
  • Backup Exec uses security certificates between clients and media servers. It is not possible for an intruder to masquerade as a client and request a restore of production data.
  • Self-protection: Backup Exec not only protects the production data, but it has the capability to protect itself against hardware failures or human errors.

 

Deduplication for dollar zero?

One of the data protection experts asked me a question after reading my blog on Deduplication Dilemma: Veeam or Data Domain.

I am paraphrasing his question as our conversation was limited to 140 characters at time through Twitter.

“Have you seen this best practice blog on Veeam with Exagrid? Here is the blog.  It says not to do reverse incremental backups. The test Mr. Attlia ran was incomplete. The Veeam deduplication at the first pass is poor, but after that it is worth it, right?”

These are all great questions. I thought of dissecting each aspect and share it here. Before I do that I want to make it clear that deduplication devices are fantastic for use in backups. These work great with backup applications that really offer the ability to restore individual objects. If the backup application ‘knows’ how to retrieve specific objects from backup storage, target deduplication adds a lot of value.  That is why NetBackup, Backup Exec, TSM, NetWorker and the like play well with target deduplication appliances. Veeam, on the other hand, simply mounts the VMDK file from backup store and asks the application administrator to fish for the item he/she is looking for. This is where Veeam falls apart if you try to deploy it in medium to large environments. Although target deduplication appliances are disk based, they are optimized more for sequential access as backup jobs mostly follow sequential I/O pattern. When you perform random I/O on these devices (as it happens when a VM is directly run from it), there is a limit to which those devices can perform.

Exagrid: a great company helping out customers

Exgrid has an advantage here. It has flexibility to keep the most recent backup in hydrated form (Exagrid uses post-process deduplication) which works well with Veeam if you employ reverse incremental backups. In reverse incremental backups, the most recent backup is always a full backup. You can eliminate the performance issues inherent in mounting the image on an ESX host when the image is being served in hydrated form. This is good from the recovery performance perspective.  However, Exagrid recommends not turning on reverse incremental method because it burdens the appliance during backups. This is another dilemma; you have to pick backup performance or recovery performance (RTO), not both.

Let me reiterate this. The problem is not with Exagrid in this case. They are sincerely trying to help customers who happened to choose Veeam. Exagrid is doing the right thing; you want to find methods to help out customers in achieving ROI no matter what backup solution they ended up choosing. I take my hat off at Exagrid in respect.

Now let us take a closer look at other recommendations from Exagrid to alleviate the pain points with Veeam.

Turn off compression in Veeam and Optimize for Local target:  Note that Exagrid suggested turning off compression and choosing Optimize for Local target option. These settings have the effect of eliminating most of what Veeam’s deduplication offers. By choosing those options, you let the real guy (Exagrid appliance) do the work.

Weren’t Mr Attila’s tests incomplete?

Mr. Attila stopped tests after the initial backup. The advantage of deduplication is visible only on subsequent backups. Hence his tests weren’t complete. However, as I stated in the blog; that test simply triggered my own research. I wasn’t basing my opinions just on Mr. Attila’s tests. I should have mentioned this in the earlier blog, but it was already becoming too big.

As I mentioned in the blog earlier, Veeam deduplication capabilities are limited. Quoting Exagrid this time: “Once the ExaGrid has all the data, it can look at the entire job at a more granular level and compress and dedupe within jobs and also across jobs! Veeam can’t do this because it has data constantly streaming into it from the SAN or ESX host, so it’s much harder to get a “big picture” of all the data.”   

If Veeam’s deduplication is the only thing you have, the problem is not just limited to the initial backup. Here are a few other reasons why a target deduplication is important when using Veeam.

  1. The deduplication is limited to a job. Veeam’s manual recommends putting VMs created from the same template into a single job to achieve that dedupe rate. It is true that VMs created from the same template have a lot of redundant OS files and whitespace so the dedupe rate will be good at the beginning. But these are just the skins or shells of you enterprise production data. The real meat is the actual data which is less likely to be the same across multiple VMs. We are better of giving that task to the real deduplication engines!
  2. Let us say you have a job with 20 production VMs. You are going to install something new on one of the VM, so you prefer to do a one-time backup before making any changes. Veeam requires you to create a new job to do this. This is not only inconvenient, but now you lose the advantage of incremental backup. You have to stream the entire VM again. Can we afford this in a production environment?
  3. Veeam incremental backups are heavily dependent on vCenter server. If you move a VM from one vCenter to another or if you had to rebuild your vCenter (Veeam cannot protect an enterprise grade vCenter running on a physical system, but let us not go there for now), you need to start seeding full backups for all your VMs. For example, if you want to migrate from a traditional vCenter server running 4.x to a vCSA 5.0, expect to reseed all the backups again.

My point is that Veeam deduplication is not something you can count on to protect a medium to large environment with these limitations. It has the price of $0 for a reason.

NetBackup and Backup Exec let you take advantage of target deduplication appliances to the fullest potential. As these platforms tracks which image has the objects the application administrator is looking for, they can simply retrieve those objects alone from backup storage. The application administrator can self-serve their needs, no need for  20th century ticket system! The journey to the Cloud starts with empowering users to self-serve their needs from the Cloud.