What’s up with VADP backups and VDDK on vSphere 5.1?

VMware vSphere 5.1 has been in the market for more than a few months now and the interest in the new capabilities is high. Because of this the market saw many backup vendors rush to announce support for vSphere 5.1 in their VADP (vStorage APIs for Data Protection) integration. Everything looked clean and shiny and new.

On November 21, Symantec made an interesting announcement1. In a nutshell, the statement was that support for vSphere 5.1 would be delayed in its NetBackup and Backup Exec products. It was because they discovered issues while testing the VADP 5.1 API for integration. The API in the current form may introduce risk in performing consistent backups and ensuring reliable restores. All vendors receive the same API, not all vendors perform the same level of testing.

In order to explain the intricacies, first we need to take a quick look at how a backup product is integrated with VMware vSphere. With each release of vSphere, VMware publishes a set of APIs known as VMware APIs for Data Protection or VADP. One of the key components of VADP is Virtual Disk Development kit aka VDDK. This is the component through which third party code receives authenticated access to vSphere Datastores and virtual machine disk files. VMware makes this component available to its technology partners. Partners (backup product vendors in this case) ship this along with their product that has calls to vStorage APIs.

With each version of vSphere, an equivalent version of VDDK is released. The VDDK is generally backward compatible to one or more earlier versions of vSphere. For example, VDDK 5.1 supports2 vSphere 5.1, 5.0 and 4.1. VDDK 5.0 supports3 vSphere 5.0, 4.1, 4.0 and VI 3.5. Since the updated VDDK is required to understand the modified data structures in a new version of vSphere, lower versions of VDDK are in general not supported for accessing a higher version of vSphere. For example, VMware historically and currently (as of today) does not support the use of VDDK 5.0 to access datastores in vSphere 5.1.  VMware documents supported versions of vSphere for each of its VDDK versions in release notes.

The key to remember is the statement in bold face above. VMware does not support any violated combinations because of the risks and uncertainties. The partners are expected to ship the correct version of VDDK when they announce the availability of support for a given vSphere release.

What Symantec announced and VMware confirmed4 is that VDDK 5.1 has issues and hence the support for vSphere 5.1 in its products will be delayed. This makes sense since VDDK 5.1 is the only version currently allowed to access vSphere 5.1. The face-saving reactions from other vendors to this announcement revealed some of the dirty games and ugly truths to come out in the area of VADP/VDDK integration.

 

  1. Vendors were claiming support for vSphere 5.1 but still shipping VDDK 5.0 with their products. This is currently not supported by VMware because of the uncertainties.  This may change but at the time vendors claiming support, they were taking risks that typically are not acceptable in field of data protection business.
  2. Vendors were mucking with API calls and silently killing hung processes. That may work for an isolated or random hang. But will not work when there are repeatable hang situations like those observed in VDDK 5.1. Plus, there are performance and reliability concerns in abruptly ending sessions with vSphere.
  3. Most vendors weren’t testing all the edge cases and never realized the problems in VDDK 5.1, thus prematurely announcing support for 5.1

 

If your backup vendor currently supports vSphere 5.1, be sure to ask what their situation is.

Sources and references:

1. Quality wins every time: vSphere 5.1 support update, Symantec official blog.

2. VDDK 5.1 Release Notes, VMware Support resources

3. VDDK 5.0 Release Notes, VMware Support resources

4. Third-party backup software using VDDK 5.1 may encounter backup/restore failures, VMware Support KB

Dear EMC Avamar, please stop leeching from enterprise vSphere environments

VMware introduced vStorage APIs for Data Protection (VADP) so that backup products can do centralized, efficient, off-host LAN free backup of vSphere virtual machines.

In the physical world, most systems have plenty of resources, often underutilized. Running backup agent in such a system wasn’t a primary concern for most workloads. The era of virtualization changed things drastically. Server consolidation via virtualization allowed organizations to get the most out of their hardware investment. That means backup agents do not have the luxury to simply take up resources from production workloads anymore as the underlying ESXi infrastructure is optimized and right-sized to get line of business applications running smoothly.

VMware solved the backup agent problem from the early days of ESX/ESXi hosts. The SAN transport method for virtual machine backup was born during the old VCB (VMware Consolidated Backup) days and further enhanced in VADP (vStorage APIs for Data Protection). The idea is simple. Let the snapshots of virtual machine be presented to a workhorse backup host and allow that system do the heavy lifting of processing and moving data to backup storage. The CPU, memory and I/O resources on ESX/ESXi hosts are not used during backups. Thus the production virtual machines are not starved for hypervisor resources during backups.

For non-SAN environments like NFS based datastores, the same dedicated host can use Network Block Device (NBD) transport to stream data through management network. Although it is not as efficient as SAN transport, it still offloaded most of the backup processing to the dedicated physical host.

Dedicating one or more workhorse backup systems to do backups was not practical for small business environments and remote offices. To accommodate that business need, VMware allowed virtual machines to act as backup proxy hosts for smaller deployments. This is how hotadd transport was introduced.

Thus your backup strategy is to use a dedicated physical workhorse backup system to offload all or part of backup processing using SAN or NBD transports. For really small environments, a virtual machine with NBD or hotadd transport would suffice.

Somehow EMC missed this memo. Ironically, EMC had been the proponent of running Avamar agent inside the guest instead of adopting VMware’s VADP. The argument was that the source side deduplication at Avamar agent minimizes the amount of data to be moved across the wire. While that is indeed true, EMC conveniently forgot to mention that CPU intensive deduplication within the backup agent would indeed leech ESXi resources away from production workloads!

Then EMC conceded and announced VADP support. But the saga continues. What EMC had provided is hotadd support for VADP. That means you allocate multiple proxy virtual machines even in the case of enterprise vSphere environments. Some of the best practice documents for Avamar suggest deploying a backup proxy host for every 20 virtual machines. Typical vSphere environment in an enterprise would have 1000 to 3000 virtual machines. That translates to 50 to 150 proxy hosts! These systems are literally the leach worms in vSphere environment draining resources that belong to production applications.

The giant tower of energy consuming nodes in Avamar grid is not even lifting a finger in processing backups! It is merely a storage system. The real workhorses are ESXi hosts giving in CPU, memory and I/O resources to Avamar proxy hosts to generate and deduplicate backup stream.

The story does not change even if you replace Avamar Datastore with a Data Domain device. In that case, the DD Boost agent running on Avamar proxy hosts are draining resources from ESXi to reduce data at source and send deduplicated data to Data Domain system.

EMC BRS should seriously look at the way Avamar proxy hosts with or without DD Boost are leaching resources from precious production workloads. The method used by Avamar is recommended only for SMB and remote office environments. Take the hint from VMware engineering as to why Avamar technology was borrowed to provide a solution for SMB customers in VMware Data Protection (VDP) product. You can’t chop a tree with a penknife!

The best example for effectively using VADP for enterprise vSphere is NetBackup 5220. EMC BRS could learn a lesson or two from how Symantec integrates with VMware in a much better way. This appliance is a complete backup system with intelligent deduplication and VADP support built right in for VMware backups.  This appliance does the heavy lifting so that production workloads are unaffected by backups.

How about recovery? For thick provisioned disks SAN transport is indeed the fastest. For thin provisioned disks, NBD performs much better. The good news on Symantec NetBackup 5220 is that the user could control the transport method for restores as well. You might have done the backup using SAN transport, however you can do the restore using NBD if you are restoring thin provisioned virtual machines. For Avamar, hot-add is the end-all for all approaches. NBD on a virtual proxy isn’t useful, hence using that is a moot point when the product offers just virtual machine proxy for VADP.

The question is…

Dear EMC Avamar, when will you offer an enterprise grade VADP based backup for your customers? They deserve enterprise grade protection for the investment they had done for large Avamar  Datastores and Data Domain devices.

 

 

VMware announces vSphere Data Protection (VDP), what is in it for you?

vSphere Data Protection (VDP) is VMware’s new virtual backup appliance for SMB available in VMware vSphere 5.1. It replaces the older VMware Data Recovery (vDR) product. There had been a number of confusions around this announcement; partly due the way EMC, VMware’s parent company, made some press releases.

Is VDP the same as EMC’s Avamar Virtual Edition (Avamar VE)?

No, it is not. VDP is a product from VMware. The only technology VMware had used from Avamar is its deduplication engine. The older vDR had limited dedupe capabilities as it was mainly coming from change block tracking (CBT) in vStorage APIs for Data Protection (VADP). With Avamar’s technology, VDP now provides variable block based deduplication.

I heard that I can upgrade from VDP to EMC Avamar if I need to grow beyond 2TB, is that true?

No, VDP is not a ‘lite’ version of Avamar. It is a different product altogether.

What are my options if I need to grow beyond 2TB?

You could add additional VDP appliances. Up to 10 VDP appliances are supported under one vCenter server. However, these are separate islands of storage. These appliances do not provide global deduplication among these storage pools.

Having said that it is more likely for you to hit other limitations in VDP before hitting the 2TB limit. Note that Avamar based deduplication engine is suitable only for SMBs who could afford to have black out windows and maintenance windows in their backup solution. These are the periods of time where the house keeping work is being done by dedupe engine.  The system is not available for running backup jobs.

Only 8 virtual machines can be backed up concurrently that might increase backup windows. There is no SAN transport capability to offload production ESXi hosts from backup tasks. There is no good way to make additional copies for redundancy or extended retention like replication to remote location or cloud. VMware has made it clear that VDP is truly for SMBs and encourages customers to look at enterprise class backup solutions from partners for larger environments.

Why would EMC let VMware use its Avamar technology at no additional cost to customers? Is EMC trying to promote its products?

Just like how Windows/UNIX/Linux operating environments provide basic utilities for backups, VMware had always provided basic backup solution with its offerings. In the days of ESX service console, the Linux based console provided tools like tar and cpio. With ESXi where service console is no more, vDR was brought to the table. vDR had its limitations. Now the choice is to innovate vDR or license a relatively mature technology. As parent company has a solution, VMware went the route of taking Avamar dedupe engine for storage and build its own capabilities for scheduling backups and managing recovery points.

EMC’s Avamar is a popular product in small environments. Although EMC had been trying hard to make Avamar enterprise ready, its deduplication engine has significant limitations. It requires blackout and maintenance windows. With larger capacities, the duration of these windows also increases. With the acquisition of Data Domain, EMC is now focusing more on using its DD Boost technology for distributing the deduplication workload. In fact, EMC recommends the use of Data Domain Boost with Avamar (instead of using Avamar’s dedupe engine) for larger workloads. I believe it was a good decision to support VMware’s SMB market with a technology that was meant for SMB in the first place. I think Avamar dedupe engine is counting its days as a technology that can make money. See my earlier blog on EMC’s backup portfolio.

Stay tuned. More on VDP coming soon!

What do NetApp ONTAP and Symantec NetBackup have in common?

A friend of mine forwarded this link to the interview SearchStorage.com recently did with Dave Hitz, one of the founders of NetApp. It is an interesting read and the major topic is the new clustering capabilities in OnTap 8. When he was asked about EMC’s Isilon, I found his response to hit a home run.

“If you look at features EMC can support, you end up with a complete list. If you break apart their architectures and look at the same feature list by architecture, you end up finding the main feature Isilon has is clustering, which is great. Unfortunately, it’s not in combination with the full suite of rich data management capabilities. That’s the No. 1 difference Ontap has — it’s the same Ontap that has all this cool stuff in it.” ,  said Dave Hitz. 

The context here is the fact that the foundational technology powering all storage systems from NetApp is ONTAP (with E-series being an outlier) and customers get the choice of footprint and features to match their workloads. EMC’s storage division, on the other hand, provides different products for overlapping set of workloads like VNX, VMAX, Isilion etc.

If you think about it, this response is applicable even when you look at other business units from EMC as well. My favorite is EMC’s Backup and Recovery Services (BRS) division. They have four different products; Avamar, Data Domain, NetWorker and HomeBase, pretty much serving the same market. If I were to fit Dave’s quote in the context of Backup and Recovery and use Symantec’s NetBackup as the competitor for EMC Backup, it would go something like this.

If you look at features EMC can support as a vendor for backup and recovery, you end up with a near-complete list. If you break apart their architectures and look at the same feature list by architecture, you end up finding that the main value Data Domain has is storage reduction at target with federation capabilities for limited application workloads. Avamar has full management capabilities but only for smaller workloads. NetWorker has decent long-term retention capabilities and track record but had been on life support. HomeBase provides Bare Metal Recovery. Unfortunately, none of these products are with a full suite of rich data management capabilities for end-to-end protection that can bring down capital and operational expenses in managing recovery points. That’s the No. 1 difference NetBackup has — it’s the same NetBackup that has all those cool stuff in one platform and a lot more innovations like managing snapshots, replicas, virtualized applications, backup acceleration etc. 

As always, the standard disclaimer applies here. This is just my opinion. Although I work for Symantec, the above statement should not be considered as the view of my employer.

 

Will EMC BRS kill Avamar or NetWorker?

EMC World 2012 has come and gone. For those watching the Backup and Recovery Services (BRS) division would notice a drastic shift in strategy since last year. Is Avamar counting its days?

Surprised? Let me explain. Remember the “Tape sucks! Move on!”  Campaign sung by BRS last year? They even mocked Google for recovering from tapes. They wanted the world to look at Avamar and Data Domain, the two products with spinning disks as the houses of backups. The other child NetWorker was mostly ignored and was on life support just to get by with the era of tapes.

BRS seems to have come to grip with the reality to some extent. The incremental updates to Avamar and revelation of NetWorker 8 features tend to indicate that BRS is taking a 180-degree turn.

No real updates for Avamar Data Store: All the announced business critical applications support in Avamar are for both Data Domain Boost and Avamar native client. Hyper-V that is popular among SMB workloads is now available through Boost to a Data Domain target. Last year, BRS’ announcement was that DD is for specific work loads and Avamar Data Store is for everything else. Now Boost is getting more attention and Avamar engine by itself pretty much stays the same.  The blackout windows in Avamar Data Store already annoy customers. Data Domain deduplication engine is preferred for target dedupe and DD Boost will replace source side deduplication eventually? Inspired by Symantec’s Dedupe Everywhere strategy?

Note: Thank to Ian’s comment on clarifying that newer application support is available for Avamar as well. Not just for Data Domain through DD Boost.

Emergence of Media Access Node: BRS realized that customers with longer retention requirements would not buy in on ‘keep it on disk’ message. Tape provides economies of scale. Modern tape technologies are superior in performance and reliability. Now, BRS ships a NetWorker node underneath the cover as Media Access Node in Avamar to copy rehydrated data into tape in NetWorker tape format.

NetWorker 8.0 getting some facelift: Although NetWorker was ignored in keynotes, BRS made a deliberate attempt this year to show what is happening to NetWorker. It was expecting the morgue but now pulled back and is getting revved up. There is a long road ahead to convince customers, but BRS says it is putting equal number of resources on NetWorker as was done on Avamar.  Not to mention about the newfound love, Spectralogic, to compete with IBM and Oracle.

If you pay closer attention, all that Avamar got is to make things better for Data Domain (Boost expansion, multi-stream support…) and NetWorker (data stored in NetWorker tape format). In a nutshell, BRS wants everyone to keep backup data on either Data Domain dedupe format or NetWorker tape format. Once NetWorker and Data Domain Boost combination can support backups through WAN, Avamar may not have anything to offer. From operating margin perspective, Avamar as a product may become a dog in BCG Growth-share matrix? The one eventually going to morgue looks to be Avamar Dedupe engine?

Not seeing your comments about this post? Please read this note. 

vSphere changed block tracking: A powerful weapon for backup applications to shrink backup window

Changed block tracking is not a new technology. Those who have used Storage Foundation for Oracle would know that VERITAS file system (VxFS) provides no-data check points which can be used by backup applications to identify and backup just the changed blocks from the file systems where database files are housed.  This integration was in NetBackup since version 4.5 that was released 10 years ago! It is still used by Fortune 500 companies to protect mission critical Oracle databases that would otherwise require a large backup window with traditional RMAN streaming backups.

VMware introduced change block tracking (CBT) since vSphere 4.0 and is available for virtual machines version 7 or higher. NetBackup 7.0 added support for CBT right away. Backing up VMware vSphere environments got faster. When a VM has CBT turned on, it can track changes to virtual machine disk (VMDKs) sectors.  Its impact on VM performance is marginal. Backup applications with VADP (vStorage APIs for Data Protection) support can use an API (named QueryChangedDiskAreas) to identify and copy changed blocks from a particular point in time. This time point is identified using an argument named ChangeId in the API call.

VMware has made this quite easy for backup vendors to implement. Powerful weapons can be dangerous when not used with utmost care. An unfortunate problem in Avamar’s implementation of CBT came to light recently. I am not picking on Avamar developers here, it is not possible to predict all the edge cases during development and they are working hard to fix this data loss situation. As an engineer myself, I truly empathize with Avamar developers for getting themselves into this unfortunate situation. This blog is a humble attempt to explain what had happened as I got a few questions from the field seeking input on the use of CBT after the EMC reported issues in Avamar.

As we know, VADP lets you query the changed disk areas to get all the changes in a VMDK since a point in time corresponding to a previous snapshot. Once the changed blocks are identified, those blocks are transferred to the backup storage. The way the changed blocks are used by the backup application to create the recovery point (i.e. backup image) varies from vendor to vendor.

No matter how the recovery point is synthesized, the backup application must make sure that the changed blocks are accurately associated with the correct VMDK because a VM can have many disks. As you can imagine if the blocks were associated with the wrong disk in backup image; the image is not an accurate representation of source. The recovery from this backup image will fail or will result in corrupt data on source.

The correct way to identify VMDK is using their UUIDs which are always unique. Using positional identifies like controller-target-LUN at the VM level are not reliable as those numbers could change when some of the VMDK are removed or new ones are added to a VM. This is an example of disk re-order problem. This re-order can also happen for non-user initiated operations. In Avamar’s case, the problem was that the changed blocks belonging one VMDK was getting associated with a different VMDK in backup storage on account of VMDK re-ordering. Thus the resulting backup image (recovery point) generated did not represent the actual state of VMDK being protected.

To make the unfortunate matter worse, there was a cascading effect. It appears that Avamar’s implementation of generating a recovery point is to use the previous backup as the base. If disk re-order happened after nth backup, all backups after nth backup are affected on account of the cascading effect because new backups are inheriting the base from corrupted image.

This sounds scary. That is how I started getting questions on reliability of CBT for backups from the field. Symantec supports CBT in both Backup Exec and NetBackup. Are Symantec customers safe?

Yes, Symantec customers using NetBackup and Backup Exec are safe.

How do Symantec NetBackup and Backup Exec handle re-ordering? Block level tracking and associated risks were well thought out during the implementation. Implementation for block level tracking is not something new for Symantec engineering because such situations were accounted for in the design for implementing VxFS’s no-data check point block level tracking several years ago.

There are multiple layers of resiliency built-in Symantec’s implementation of CBT support. I shall share oversimplified explanations for two of those relevant in ensuring data integrity that are relevant here.

Using UUID to accurately associate ChangeId to correct VMDK: We already touched on this. UUID is always unique and using it to associate the previous point in time for VMDK is safe. Even when VMDKs get re-ordered in a VM, UUID stays the same. Thus both NetBackup and Backup Exec always associate the changed blocks to the correct VM disk.

Superior architecture that eliminates the ‘cascading-effect’:  Generating a corrupted recovery point is bad. What is worse is to use it as the base for newer recovery points. The corruption goes on and hurt the business if left unnoticed for long time. NetBackup and Backup Exec never directly inject changed blocks to an existing backup to create a new recovery point. The changed blocks are referenced separately in the backup storage. During a restore, NetBackup recreates the point in time during run-time. This is the reason NetBackup and Backup Exec are able to support block level incremental backups even to tape media! Thus a corrupted backup (should that ever happen) never ‘propagates’ corruption to future backups.