The Big Hole in EMC Big Data backup story

It is one of the crucial roles for the marketing team in any organization to communicate the value of its products and services. It is not uncommon (pardon the double negative) for organizations to show the best side of its story while deliberately hiding the weaker aspects through fine prints. The left side of the picture below is the snapshot of breakfast cereal (General Mills’ Total) that came with my breakfast order in Sheraton while travelling on business.

EMC appears to have a Big Hole in its Big Data Backup
EMC appears to have a Big Hole in its Big Data Backup

Note that General Mills had claimed 100% of daily value of 11 vitamins and minerals but with an asterisk. The claim is true only if I consume 53g serving, but the box has only 33g!

Although I may have felt a bit taken back as a consumer, I enjoyed giving a bit of hard time to my General Mills friends and I moved on. This is a small transaction.

What if you were responsible for a transaction worth tens of thousands of dollars and were pitched a glass half-full story like this? It does happen. That General Mills cereal box is what came to my mind when I saw this blog from EMC on protecting Big Data (Teradata) workloads using EMC ‘Big Data backup solution’.

General Mills had the courtesy put the fine print that part of the vitamins and minerals are missing from its box. EMC’s blog didn’t really call out what was missing from its ‘box’ aka Data Domain device to protect Teradata workload using Teradata Data Stream Architecture. In fact it is missing the real brain of the solution: NetBackup!

First a little bit of history and some naked truth. Teradata had been working with NetBackup for over a decade to provide data protection for its workloads. In fact, Teradata sells the NetBackup Agent for Teradata for its customers. This agent pushes the data stream to NetBackup media servers. This is where the real workload aware intelligence (the real brain for this Big Data backup) is built. Once NetBackup media server receives the data stream it can store it on any supported storage: NetBackup Deduplication Pool, NetBackup Advanced Disk Pool, NetBackup OpenStorage Pool or even on a tape storage unit! When it comes to NetBackup OpenStorage Pool, it does not matter who the OpenStorage partner is; it can be EMC Data Domain, Quantum DXi,… The naked truth is that the backend devices are dumb storage devices from the view of NetBackup Agent for Teradata (the Teradata BAR component depicted in the blog).

EMC’s blog appears to have been designed to mislead the reader. It tends to imply that there is some sort of special sauce built natively into Data Domain (or Data Domain Boost) for Teradata BAR stream. The blog is trying to attach EMC to Big Data type workloads through marketing. May I say that the hole is quite big in EMC’s Big Data backup story!

I am speculating that EMC had been telling this story for a while in private engagements with clients. Note that the blog is simply displaying some of EMC’s slides that are marked ‘confidential’. The author forgot to remove it before publishing it. In closed meetings with joint customers of Teradata and NetBackup, a slide like this will create the illusion that Data Domain has something special for Teradata backup. Now the truth just leaked!

Client Direct: NetBackup vs. NetWorker

NetBackup introduced Client Direct capability a few years back with NetBackup 7.0 release. This is a break-through innovation in backup infrastructure architecture. Traditionally backup is a process where data is read from production client, transmitted over wire in its entirety to a backup server and then written to storage. The emergence of target dedupe appliances behind a backup server meant that backup can now take three hops through network. It hops from client to backup server first, then it hops from backup server to deduplication appliance. NetBackup changed this game. NetBackup client can dedupe backup stream at source and send deduplicated data directly to NetBackup’s deduplication pool, for example a NetBackup 5020 deduplication appliance, as illustrated below.

NetBackup Client Direct

This architecture is possible in NetBackup because it has several innovations that reduce the impact of running deduplication at the production client.

  1. NetBackup Accelerator: This technology features a platform independent track log that intelligently detects changed files without the need for enumerating the entire file system. Then it optimally synthesizes a full backup image at the storage. The result: Full backups can be run using the resources needed to run an incremental backup.
  2. NetBackup Client Side Deduplication Cache: This enables the production client to run deduplication by comparing the generated fingerprints for the chucks in the changed file (detected intelligently as explained in 1) against the previous backup set without shipping the fingerprint to storage for comparison. The result: Superior federated deduplication without the slow chatter across network.
  3. Intelligent Hybrid Chunking that is not CPU bound: Deduplication chunking is done typically using variable block method or fixed block method. The first one is CPU intensive and the second one is less efficient in data reduction rate. NetBackup uses the best of both worlds by using intelligent hybrid chunking. As deduplication-fingerprinting logic is built into the client, it can start the chunking exactly after identifying the object boundaries. Thus you get the advantage of not being CPU bound while also not suffering from low deduplication rate.

Reducing of impact on production client’s resources, reducing the impact on production network, reducing the number of hops and reducing the impact on backup server (translation, increased scalability) make NetBackup Client Direct a unique feature. The popularity of this feature had made ‘Client Direct’ a common innovation name that appears in RFPs for backup solutions.

The pressure is causing other backup vendors to come up with ‘Client Direct’. EMC announced last week that NetWorker 8.0 will have this capability and even named it ‘Client Direct’ so that the checkboxes in RFP can be ticked. A closer look reveals that NetWorker Client Direct is suitable for checkbox in RFP, but really not ready for primetime as is.

  1. NetWorker Client has no intelligent detection of changed files. NetWorker also does not have any sort of optimized synthetics. The result: Running full backups with NetWorker Client Direct will use significant amount of processing power from production clients.
  2. The NetWorker client and does not federate deduplication; it is done by DD Boost. As these two are essentially unaware of each other’s format, there is no way to cache fingerprints of the chunks from previous backups. That means excessive chitchat with the target Data Domain device during backups.
  3. DD Boost is the process of offloading some of the Data Domain deduplication processing to other systems. In this case, the production clients are taking that load. As clearly documented in Data Domain SISL architecture “SISL takes the pressure off of disk accesses as a bottleneck so that the system relies on the speed of the CPU to deliver inline deduplication performance”. Translation: CPU bound chunking. When this is offloaded to production clients, it can severely affect the performance of production systems with large backup workloads.

Even though EMC can mark the checkboxes in RFPs; their specialists are less likely to encourage POCs with NetWorker Client Direct. In a neck-to-neck battle, it appears that NetWorker has a long road ahead to match NetBackup Client Direct.

EMC or HP: Who is stretching the truth on deduplication system performance?

EMC proudly announced the availability of Data Domain 990 during EMC World 2012 on May 21st. The claim in the news release was that the system could backup up to 248 TB in 8-hour backup window with 31 TB/hr throughput. Further, it claimed that it is 6x faster than closer competitor.

The pride was shattered within 2 weeks. Even Kardashion’s marriage lasted longer than the claim. HP announced that it could protect up 100 TB/hr using its StoreOnce family of products. EMC looked at it with tears and finally responded as given here

EMC said HP’s decision was “puzzling”, and argued the comparison was not fair because HP’s claim was for four hardware systems working on four storage pools compared to EMC’s figures for one system and one pool. Deduplication, which removes copies of data from storage to improve usage, only works within pools of data.

Now is time for a reality check.

Number of systems involved in deduplication processing: EMC’s claim is that Data Domain 990 is a single head unit while HP StoreOnce B6200 is a multi-node system. From the first look, it sounds like a legitimate argument; but the reality is that EMC has no reason to shed crocodile’s tears about this. Here is why.

The 31 TB/hr rate for Data Domain 990 is coming from Data Domain Boost, the software component that offloads most of the processor-intensive deduplication processing to backup servers and/or application servers. The unit by itself is not doing all the work. The story is not different for HP B6200 either; it is making use of StoreOnce Catalyst software, which does similar to what Data Domain Boost does for Data Domain 990.

The absolute number of processing heads shouldn’t matter in this case as the actual performance numbers are skewed on account of distributed processing. I would even give credit to HP, as their solution is highly available with two nodes serving one storage pool. Backups are the last line of defense in an enterprise. High Availability brings additional customer value.

Number of name spaces: Single name space provides deduplication across all the workload ingested into the storage pool. Data Domain 990 is a single name space device with one processing head. You buy HP B6200 in the form of two nodes and storage known as couplets.  It is not crystal clear from HP’s documentation whether multiple couplets can share the same name space or they use dedicated name spaces. I am giving the benefit of doubt that EMC did the research and made the statement on this. Some of the defensive comments HP did after EMC’s reaction tend to indicate the HP stretched the truth a little here.

HP marketing veep Craig Nunes says an 8-node B6200 is a single system because it is managed as one and has a single namespace. The single namespace is segmented into four individual namespaces, one per couplet, and, he says, “next year I could do a firmware update and change that”.

So, I am inclined to support EMC from this point unless someone can confirm from HP’s documentation that a four-couplet unit uses a single name space.

Truth in comparisons: 

EMC’s claim: 6x faster than closer competitor. HP’s claim: 3 times faster (backups) than closest competitor

The statements won’t actually tell you how ‘closer/closest’ competitor is decided. EMC is defining closer competition based on IDC’s report on market share on Purpose-Built Backup Appliances (PBBA) and they are referring to IBM. They selected to compare IBM because they have the poorest number. The other vendors in the list with– HP at 25 TB/hr without Catalyst and Symantec at 23.7 TB/hr for its NetBackup 5220– have solutions superior to IBM! EMC cannot even claim 2x (let alone 6x) if the closest comparison was based on performance itself.

HP defined closest competitor in terms of the actual performance. They compared against EMC’s 31 TB/hr to make the 3 times faster claim with 100 TB/hr.

Verdict: Always ask questions on metrics! It is easy to make a claim while staying vague on details.

Not seeing your comments on this post? Please read this note.

Will EMC BRS kill Avamar or NetWorker?

EMC World 2012 has come and gone. For those watching the Backup and Recovery Services (BRS) division would notice a drastic shift in strategy since last year. Is Avamar counting its days?

Surprised? Let me explain. Remember the “Tape sucks! Move on!”  Campaign sung by BRS last year? They even mocked Google for recovering from tapes. They wanted the world to look at Avamar and Data Domain, the two products with spinning disks as the houses of backups. The other child NetWorker was mostly ignored and was on life support just to get by with the era of tapes.

BRS seems to have come to grip with the reality to some extent. The incremental updates to Avamar and revelation of NetWorker 8 features tend to indicate that BRS is taking a 180-degree turn.

No real updates for Avamar Data Store: All the announced business critical applications support in Avamar are for both Data Domain Boost and Avamar native client. Hyper-V that is popular among SMB workloads is now available through Boost to a Data Domain target. Last year, BRS’ announcement was that DD is for specific work loads and Avamar Data Store is for everything else. Now Boost is getting more attention and Avamar engine by itself pretty much stays the same.  The blackout windows in Avamar Data Store already annoy customers. Data Domain deduplication engine is preferred for target dedupe and DD Boost will replace source side deduplication eventually? Inspired by Symantec’s Dedupe Everywhere strategy?

Note: Thank to Ian’s comment on clarifying that newer application support is available for Avamar as well. Not just for Data Domain through DD Boost.

Emergence of Media Access Node: BRS realized that customers with longer retention requirements would not buy in on ‘keep it on disk’ message. Tape provides economies of scale. Modern tape technologies are superior in performance and reliability. Now, BRS ships a NetWorker node underneath the cover as Media Access Node in Avamar to copy rehydrated data into tape in NetWorker tape format.

NetWorker 8.0 getting some facelift: Although NetWorker was ignored in keynotes, BRS made a deliberate attempt this year to show what is happening to NetWorker. It was expecting the morgue but now pulled back and is getting revved up. There is a long road ahead to convince customers, but BRS says it is putting equal number of resources on NetWorker as was done on Avamar.  Not to mention about the newfound love, Spectralogic, to compete with IBM and Oracle.

If you pay closer attention, all that Avamar got is to make things better for Data Domain (Boost expansion, multi-stream support…) and NetWorker (data stored in NetWorker tape format). In a nutshell, BRS wants everyone to keep backup data on either Data Domain dedupe format or NetWorker tape format. Once NetWorker and Data Domain Boost combination can support backups through WAN, Avamar may not have anything to offer. From operating margin perspective, Avamar as a product may become a dog in BCG Growth-share matrix? The one eventually going to morgue looks to be Avamar Dedupe engine?

Not seeing your comments about this post? Please read this note. 

Deduplication Storage Pool Reliability: The devil is in the details

As you guys already know, I do travel a lot and attend trade shows where I represent Symantec. While I was briefing a visitor at Symantec booth on NetBackup 5020 appliance, he asked a question which was quite interesting. “We have requested RFPs from multiple vendors for deploying deduplication solution for backups. EMC sales team told us that Data Domain 800 series is better than NetBackup 5020 appliances in terms of reliability. They said that if one node in a multi-node NetBackup 5020 goes down, the entire deduplication pool goes down. What do you think about it?”

I thanked him for his question. I took a good 20 minutes to explain the situation. I thought it will be nice to document this in a blog for a fair comparison.

Let us compare configurations based on Data Domain 860 and NetBackup 5020. Let us say that the customer is looking to create 96TB of deduplication pool right now. He may need more storage in future.

With Data Domain 860, it would require four ES30 shelves (with 2TB drives) to create this capacity. Plus you need the 860 head unit.  With NetBackup 5020, you would need three nodes.

Implementing a 96TB deduplication pool

Implementing a 96TB deduplication pool

Thus, the EMC solution has a total of 5 components (1 head and 4 shelves). EMC’s 96TB deduplication pool will go down if any of the five components fail.

Symantec solution has a total of three components (3 NetBackup 5020 nodes). Symantec’s 96TB deduplication pool will go down if any of the three components fail.

Observation 1: EMC solution has more single points of failure than Symantec’s solution for a given capacity.

Let us dig deeper. Let us look at the components that actually store data, the storage modules.

Each Data Domain ES30 shelf will have 15 spindles: 12 data drives, 2 parity drives and 1 hot spare. Each shelf can withstand 3 concurrent drive failures.

Each NetBackup 5020 nodes have 22 spindles (not counting the two drives in RAID1 for system disk): 18 data drives, 2 parity drives and 2 hot spares. This configuration can withstand four concurrent drive failures.

Both systems use SATA drives. The theoretical1 annualized failure rate (AFR) for a SATA drive is approximately 1.46%. Robin Harris’ StorageMojo2 blog has some great information on a study done by Google. He quotes the idea of calculated AFR to be 2.88%

Since we are actually comparing the overall storage modules (ES30 storage shelf vs. NetBackup 5020 storage shelf), let us not worry about the absolute value of AFR of a disk drive. For our discussion, let us assume that both Symantec and Data Domain are buying disks from the same manufacturer. Let the AFR be 3% to simplify probability calculations.

An AFR of 3% indicates that the probability of a SATA drive to fail within a year is 3/100.

In case of Data Domain 860 with ES30 shelves, you will lose data if more than 3 drives fail in a year and failed drives were not replaced. The probability of four drives failing in a year can be calculated using conditional probability3. The value is (3/100)4 = 0.000081%

In case of a NetBackup 5020 node, you will lose data if more than 4 drives fail in a year and were not replaced. The probability here is (3/100)5 = 0.00000243%

Note the probability of data loss is low in both cases even if you don’t replace the failed drives for a year. This is why RAID6 and hot spare play a significant role in delivering storage reliability. That is the main point I want to make here. However the probability of losing data on ES30 shelf is 33 times higher than the probability of losing data in NetBackup 5020! The reason here is the extra hot spare that you have in NetBackup 5020 node that provides additional protection.

Observation 2: From storage module perspective, although the absolute probability of losing data is quite low for both EMC and Symantec solutions, the relative probability of losing data on EMC’s ES30 shelf is 33 times higher than that in NetBackup 5020 if drives have identical AFR.

So don’t you disagree with what EMC sales rep has reportedly told about NetBackup 5020 appliances? The devil is always in the details, isn’t it?

Disclaimer: As I had already stated in About Me page in, the thoughts expressed here are my own. My employer or school has not endorsed/supported any of the content in this blog. If there are errors in this post, contact me at @AbdulRasheed127 on Twitter and I will be happy to correct it. I am not entertaining comments until I invest in a good spam blocker, sorry for the inconvenience 🙁


  1. Annualized Failure Rate (AFR) and Mean Time between Failures (MTBF) in: Seagate Barracuda ES SATA Product Manual, Page 29, Chapter 2.12: Reliability
  2. Robin Harris. Google’s Disk Failure Experience
  3. Conditional Probability: P(AB) = P(A)*P(B|A)

If A and B are independent outcomes, P(B|A) = P(B)

In which case, P(AB) = P(A) * P(B)

Turning cheap disk storage into an intelligent deduplication pool in NetBackup

5. NetBackup Intelligent Deduplication Pool

Deduplication for backups does not need an introduction. In fact, deduplication is what made disk storage a viable alternative for tapes. Deduplication storage is available from several vendors in the form of pre-packaged storage and software. Most of the backup vendors also provide some level of data reduction using deduplication or deduplication-like features.

Often we hear that backups of virtual environments are ideal for deduplication. While I agree with this statement, several articles tend to give the wrong perception when it comes to why it is a good idea.

The general wisdom goes like this. As there are many instances of guest operating systems, there are many duplicate files and hence deduplication is recommended.  A vendor may use this reasoning to sell you the deduplication appliance or to differentiate their backup product from others. This is short-sighted view. First of all, multiple instances of the same version of operating system are possible even when your environment is not virtualized; hence that argument is weak. Secondly, operating system files contribute less than 10% of your data in most virtual machines hosting production applications. Hence if a vendor tells you that you need to group virtual machines from the same template to be on a backup job to make use of ‘deduplication’; what they provide is not true deduplication. Typically such techniques involve simply using the block level tracking provided by vStorage APIs for Data Protection (vADP) combined with excessive compression. Data reduction does not go beyond a given backup job.

Behold NetBackup Intelligent Deduplication. We talked about NetBackup media servers before. Attach cheap disk storage of your choice and turn on NetBackup Intelligent Deduplication by running a wizard. Your storage transforms into a powerful deduplication pool that deduplicates inline across multiple backup jobs. You can deduplicate at the target (i.e. the media server) or you can let it deduplicate at the source, if you have configured a dedicated VMware backup host.

Why is this referred to as an intelligent deduplication pool? When backup streams arrive, the deduplication engine sees the actual objects (files, database objects, application objects etc.) through a set of patent pending technologies referred to as Symantec V-Ray. Thus it deduplicates blocks after accurately identifying exact object boundaries. Compare this to third party target deduplication devices where the backup stream is blindly chopped to guess the boundaries and identify duplicate segments.

   The other aspect of NetBackup Intelligent Deduplication pool is its scale-out capability.  The ability to grow storage and processing capacity independently as your environment grows.  The storage capacity can be grown from 1TB to 32TB  thereby letting you protect 100s of terabytes of backup images. In addition you can add additional media servers to do dedupe processing on behalf of the media server hosting the deduplication storage. The scale out capability can also be established by simply adding additional VMware backup hosts. The global deduplication occurs across multiple backup jobs, multiple VMware backup hosts and multiple media servers. It is scale out in multiple dimensions! A typical NetBackup environment can protect multiple vSphere environments and deduplicate across virtual machines in all of them.

Back to NetBackup 101 for VMware Professionals main page

Deduplication for dollar zero?

One of the data protection experts asked me a question after reading my blog on Deduplication Dilemma: Veeam or Data Domain.

I am paraphrasing his question as our conversation was limited to 140 characters at time through Twitter.

“Have you seen this best practice blog on Veeam with Exagrid? Here is the blog.  It says not to do reverse incremental backups. The test Mr. Attlia ran was incomplete. The Veeam deduplication at the first pass is poor, but after that it is worth it, right?”

These are all great questions. I thought of dissecting each aspect and share it here. Before I do that I want to make it clear that deduplication devices are fantastic for use in backups. These work great with backup applications that really offer the ability to restore individual objects. If the backup application ‘knows’ how to retrieve specific objects from backup storage, target deduplication adds a lot of value.  That is why NetBackup, Backup Exec, TSM, NetWorker and the like play well with target deduplication appliances. Veeam, on the other hand, simply mounts the VMDK file from backup store and asks the application administrator to fish for the item he/she is looking for. This is where Veeam falls apart if you try to deploy it in medium to large environments. Although target deduplication appliances are disk based, they are optimized more for sequential access as backup jobs mostly follow sequential I/O pattern. When you perform random I/O on these devices (as it happens when a VM is directly run from it), there is a limit to which those devices can perform.

Exagrid: a great company helping out customers

Exgrid has an advantage here. It has flexibility to keep the most recent backup in hydrated form (Exagrid uses post-process deduplication) which works well with Veeam if you employ reverse incremental backups. In reverse incremental backups, the most recent backup is always a full backup. You can eliminate the performance issues inherent in mounting the image on an ESX host when the image is being served in hydrated form. This is good from the recovery performance perspective.  However, Exagrid recommends not turning on reverse incremental method because it burdens the appliance during backups. This is another dilemma; you have to pick backup performance or recovery performance (RTO), not both.

Let me reiterate this. The problem is not with Exagrid in this case. They are sincerely trying to help customers who happened to choose Veeam. Exagrid is doing the right thing; you want to find methods to help out customers in achieving ROI no matter what backup solution they ended up choosing. I take my hat off at Exagrid in respect.

Now let us take a closer look at other recommendations from Exagrid to alleviate the pain points with Veeam.

Turn off compression in Veeam and Optimize for Local target:  Note that Exagrid suggested turning off compression and choosing Optimize for Local target option. These settings have the effect of eliminating most of what Veeam’s deduplication offers. By choosing those options, you let the real guy (Exagrid appliance) do the work.

Weren’t Mr Attila’s tests incomplete?

Mr. Attila stopped tests after the initial backup. The advantage of deduplication is visible only on subsequent backups. Hence his tests weren’t complete. However, as I stated in the blog; that test simply triggered my own research. I wasn’t basing my opinions just on Mr. Attila’s tests. I should have mentioned this in the earlier blog, but it was already becoming too big.

As I mentioned in the blog earlier, Veeam deduplication capabilities are limited. Quoting Exagrid this time: “Once the ExaGrid has all the data, it can look at the entire job at a more granular level and compress and dedupe within jobs and also across jobs! Veeam can’t do this because it has data constantly streaming into it from the SAN or ESX host, so it’s much harder to get a “big picture” of all the data.”   

If Veeam’s deduplication is the only thing you have, the problem is not just limited to the initial backup. Here are a few other reasons why a target deduplication is important when using Veeam.

  1. The deduplication is limited to a job. Veeam’s manual recommends putting VMs created from the same template into a single job to achieve that dedupe rate. It is true that VMs created from the same template have a lot of redundant OS files and whitespace so the dedupe rate will be good at the beginning. But these are just the skins or shells of you enterprise production data. The real meat is the actual data which is less likely to be the same across multiple VMs. We are better of giving that task to the real deduplication engines!
  2. Let us say you have a job with 20 production VMs. You are going to install something new on one of the VM, so you prefer to do a one-time backup before making any changes. Veeam requires you to create a new job to do this. This is not only inconvenient, but now you lose the advantage of incremental backup. You have to stream the entire VM again. Can we afford this in a production environment?
  3. Veeam incremental backups are heavily dependent on vCenter server. If you move a VM from one vCenter to another or if you had to rebuild your vCenter (Veeam cannot protect an enterprise grade vCenter running on a physical system, but let us not go there for now), you need to start seeding full backups for all your VMs. For example, if you want to migrate from a traditional vCenter server running 4.x to a vCSA 5.0, expect to reseed all the backups again.

My point is that Veeam deduplication is not something you can count on to protect a medium to large environment with these limitations. It has the price of $0 for a reason.

NetBackup and Backup Exec let you take advantage of target deduplication appliances to the fullest potential. As these platforms tracks which image has the objects the application administrator is looking for, they can simply retrieve those objects alone from backup storage. The application administrator can self-serve their needs, no need for  20th century ticket system! The journey to the Cloud starts with empowering users to self-serve their needs from the Cloud.