Fault Tolerance and Fault Avoidance: Looking Beyond Data Center Tiers
August 28, 2018As the general argument goes, the fault tolerance of a tier 4 data center may be overkill for all but the most mission-critical applications of the largest enterprise. When it comes time for a business to decide, maybe the perspectives should shift to the equal need for fault avoidance.
According to the accepted Uptime Institute standard, tier 4 data center specifications call for two parallel power and cooling systems with no single point of failure (also known as 2N). While this level of fault tolerance often comes at a premium price, many enterprises see the security, reliability and redundancy as being worth it to ensure the drop in potential downtime over a tier 3 data center.
This single point of failure for any and all components is certainly nothing to scoff at when it comes to the performance of the computer equipment. Knowing that a planned approach to anytime compute component removal that foregoes compute system disruption is a major plus. But even with the understanding that comes from reading a comprehensive data center tier level guide, it becomes apparent that thinking should go beyond the tier levels to a colocation data center’s ability to provide fault avoidance.
Fault avoidance is all about the fact that many complications that lead to data center downtime can be prevented with equipment and systems monitoring, a proactive trained staff with thorough procedures, and strict maintenance protocols. In other words, fault tolerance while important is reactive where fault avoidance focuses on prevention, which is equally important.
Whether it is a tier 4 data center or a tier 3 data center, enterprises should be looking closely at these other fault avoidance parameters and systems. For instance, does the facility utilize a sophisticated and proven building management system (BMS) and building automation system (BAS)? These crucial systems allow operators to monitor systems for health status of data center equipment through gathered equipment sensor data for real-time insights. The collected data can then be used to deliver an automated response or direct proactive technician intervention.
Since we have yet to reach the ideal of the truly automated data center, highly skilled operations teams must work in tandem with the systems to anticipate problems before they occur and quickly troubleshoot issues when they do arise. Having a clear understanding of the methods and procedures of these operators as part of tier 4 data center specifications is just as important as the specifics that make the data center fault tolerant.
For the majority of enterprises that choose a tier 3 data center, it’s vital that the operator provide a detailed description of the step-by-step approach to making fault avoidance a reality. This should include everything from monitoring systems and safety requirements to proactive remediation and backout procedures for unexpected events. This proactive and preventive approach via system monitoring, automation and predictive maintenance is the foundation of fault avoidance.
While a tier 4 data center can theoretically guarantee five nines uptime, superior operator maintenance schedules do their part in bringing tier 3 data centers closer to that goal in a more cost-effective way for most businesses. In other words, redundant systems support in conjunction with BMS and/or BAS with a skilled 24×7 facilities team of a tier 3 data center can provide the high availability that most businesses need.
The additional level of fault tolerance capabilities of a tier 4 data center will always have its place for a percentage of enterprises. But the systems and processes of superior tier 3 data centers can still dramatically lower their exposure to potential downtime through fault avoidance. By looking beyond data enter tiers where fault tolerance and fault avoidance have equal footing, enterprises can make better decisions that balance real-world cost and uptime.