Deduplication Storage Pool Reliability: The devil is in the details

As you guys already know, I do travel a lot and attend trade shows where I represent Symantec. While I was briefing a visitor at Symantec booth on NetBackup 5020 appliance, he asked a question which was quite interesting. “We have requested RFPs from multiple vendors for deploying deduplication solution for backups. EMC sales team told us that Data Domain 800 series is better than NetBackup 5020 appliances in terms of reliability. They said that if one node in a multi-node NetBackup 5020 goes down, the entire deduplication pool goes down. What do you think about it?”

I thanked him for his question. I took a good 20 minutes to explain the situation. I thought it will be nice to document this in a blog for a fair comparison.

Let us compare configurations based on Data Domain 860 and NetBackup 5020. Let us say that the customer is looking to create 96TB of deduplication pool right now. He may need more storage in future.

With Data Domain 860, it would require four ES30 shelves (with 2TB drives) to create this capacity. Plus you need the 860 head unit.  With NetBackup 5020, you would need three nodes.

Implementing a 96TB deduplication pool

Implementing a 96TB deduplication pool

Thus, the EMC solution has a total of 5 components (1 head and 4 shelves). EMC’s 96TB deduplication pool will go down if any of the five components fail.

Symantec solution has a total of three components (3 NetBackup 5020 nodes). Symantec’s 96TB deduplication pool will go down if any of the three components fail.

Observation 1: EMC solution has more single points of failure than Symantec’s solution for a given capacity.

Let us dig deeper. Let us look at the components that actually store data, the storage modules.

Each Data Domain ES30 shelf will have 15 spindles: 12 data drives, 2 parity drives and 1 hot spare. Each shelf can withstand 3 concurrent drive failures.

Each NetBackup 5020 nodes have 22 spindles (not counting the two drives in RAID1 for system disk): 18 data drives, 2 parity drives and 2 hot spares. This configuration can withstand four concurrent drive failures.

Both systems use SATA drives. The theoretical1 annualized failure rate (AFR) for a SATA drive is approximately 1.46%. Robin Harris’ StorageMojo2 blog has some great information on a study done by Google. He quotes the idea of calculated AFR to be 2.88%

Since we are actually comparing the overall storage modules (ES30 storage shelf vs. NetBackup 5020 storage shelf), let us not worry about the absolute value of AFR of a disk drive. For our discussion, let us assume that both Symantec and Data Domain are buying disks from the same manufacturer. Let the AFR be 3% to simplify probability calculations.

An AFR of 3% indicates that the probability of a SATA drive to fail within a year is 3/100.

In case of Data Domain 860 with ES30 shelves, you will lose data if more than 3 drives fail in a year and failed drives were not replaced. The probability of four drives failing in a year can be calculated using conditional probability3. The value is (3/100)4 = 0.000081%

In case of a NetBackup 5020 node, you will lose data if more than 4 drives fail in a year and were not replaced. The probability here is (3/100)5 = 0.00000243%

Note the probability of data loss is low in both cases even if you don’t replace the failed drives for a year. This is why RAID6 and hot spare play a significant role in delivering storage reliability. That is the main point I want to make here. However the probability of losing data on ES30 shelf is 33 times higher than the probability of losing data in NetBackup 5020! The reason here is the extra hot spare that you have in NetBackup 5020 node that provides additional protection.

Observation 2: From storage module perspective, although the absolute probability of losing data is quite low for both EMC and Symantec solutions, the relative probability of losing data on EMC’s ES30 shelf is 33 times higher than that in NetBackup 5020 if drives have identical AFR.

So don’t you disagree with what EMC sales rep has reportedly told about NetBackup 5020 appliances? The devil is always in the details, isn’t it?

Disclaimer: As I had already stated in About Me page in MrVRay.com, the thoughts expressed here are my own. My employer or school has not endorsed/supported any of the content in this blog. If there are errors in this post, contact me at @AbdulRasheed127 on Twitter and I will be happy to correct it. I am not entertaining comments until I invest in a good spam blocker, sorry for the inconvenience 🙁

References:

  1. Annualized Failure Rate (AFR) and Mean Time between Failures (MTBF) in: Seagate Barracuda ES SATA Product Manual, Page 29, Chapter 2.12: Reliability
  2. Robin Harris. Google’s Disk Failure Experience
  3. Conditional Probability: P(AB) = P(A)*P(B|A)

If A and B are independent outcomes, P(B|A) = P(B)

In which case, P(AB) = P(A) * P(B)

Leave a Reply

Your email address will not be published. Required fields are marked *