Recently I came across a blog post from Szamos Attila. He ran a deduplication contest between Data Domain and Veeam. His was a very small test environment, just 12 virtual machines with 133GB of data. His observations were significant. I thought I share it here.
You can read more about Mr. Attila’s tests at his blog here
What does this tell you right of the bat? Well, Veeam’s deduplication is slow; not a big deal as they do not charge for deduplication separately. Not a big deal, right?
Not exactly; there is much more to this story if you take a look at the big picture.
First of all, note that this is a very small data set (just 133GB, even my laptop has more data!). Veeam’s deduplication is not really a true deduplication engine that fingerprints data segments and stores only one copy. It is basically a data reduction technique that works only on a predefined set of backup files. Veeam refers to this set of backup files as a backup repository. You can only run one backup job to a given repository at any given time. Hence if you want to backup two virtual machines concurrently, you need to send them to two different backup repositories. If you do that your backup data is not deduplicated across those two jobs. Thus your data reduction strategy using Veeam’s deduplication and concurrent processing of jobs are inversely proportional to one another. This is a major drawback as VMs generally contain a lot of redundant data. In fact, Veeam recommends to run deduplication mainly for a given backup set where all the VMs come from the same template.
Secondly, note that even with a single backup repository; this tiny data set (of just 133GB) took twice as long as Data Domain’s deduplication. Now think about a small business environment with a few terabytes of data. Imagine the time it would take to protect that data. When it comes to an enterprise data center (100s of terabytes); you must depend on a target based deduplication solution like Data Domain to get the job done.
So, can I simply let Veeam do the data movement and count on Data Domain do the deduplication? That is one way to solve this problem. But you have a multitude of other issues with that approach because of the way Veeam does restores.
Veeam does not have a good way to let application administrators in guest operating system (e.g. Exchange administrator on a VM running Microsoft Exchange) self-serve their restore needs. First the application administrator submits a ticket for restore. Then the backup administrator will mount the VMDK files from backup using a temporary VM that starts up in a production ESX host. Even to restore a small object, you have to allocate resources for the entire VM (the marketing name for this multi-step restore is U-AIR) in the ESX host. As this VM needs to ‘run’ from backup storage, it is not recommended for the backup image to be on a deduplicated storage being served through network. As target deduplication devices are designed for streaming data sequentially, the random I/O pattern caused by running a VM from such storage is painfully slow. This is even stated by the partners who are offering deduplication storage for Veeam. HP did tests with Veeam using HP StoreOnce target deduplication appliance and have published a white paper on this, please see this whitepaper in Business Week . See the section on Performance Considerations.
It is to be further noted that only the most recent backup typically stays as a single image in Veeam’s reverse incremental backup strategy. If you are in an unfortunate need to restore from a copy that is not the most recent copy, the performance degrades further while running the temporary VM from backup storage as a lot of random I/O needs to happen at the back-end.
Even after somehow you patiently waited for VM to startup from backup storage, the application administrator needs to figure out how to restore the required objects. If the object is not there in the currently mounted backup image, he/she has to send another ticket to Veeam administrator to mount a different backup image on a temporary VM. This saga continues until the application administrator finds the correct object. What a pain!
There you have it. On one side you have scalability and backup performance issues if using Veeam’s deduplication. On the other side, you have poor recovery performance and usability issues when using a target deduplication appliance with Veeam. This is the deduplication dilemma!
The good news is that target deduplication devices work well with NetBackup and Backup Exec. Both these products provide user interfaces for application administrators so that they could self-serve their recovery needs. At the same time, VM backup and recovery remains agent-less. The V-Ray powered NetBackup and Backup Exec has the capability to stream the actual object from the backup; no need to mount it using a temporary VM.