Expecting The Unexpected: Analyzing A Data Center Cooling Failure

CFD simulation software can simulate time-dependent aspects and various failure scenarios for data center managers.


Data center thermal management is often a reactive process. Servers issue warning messages, monitoring alarms activate, or employees express concern about general temperature levels/hotspots and then management decides what to do next. For incremental issues, once known, the necessary steps can be taken to resolve or improve these issues; however, what happens when a potential thermal issue only occurs during critical operations? How does one go about discovering these issues before they happen?

To reduce risk, it is necessary to test the data center’s cooling system under various cooling failure scenarios to find potential thermal issues before they happen. Because of the impracticality of physically testing cooling failures on a live data center, we look to computational fluid dynamics (CFD) simulation software to model these scenarios within data center digital twins.

Analyze the effects of failure over time to discover thermal issues

Normal operations are generally simulated using steady-state digital twin models, which reflect the assumption that the workload and cooling systems do not vary for a period of time; however, in a cooling failure, the environment will change substantially and relatively quickly, so time-dependent phenomena must be considered to model the behavior accurately.

As examples, let’s look at a few of the time-dependent aspects one should consider when analyzing a data center’s resilience to cooling system-related failures. The aspects we’ll discuss are the loss of power to the computer room air handler (CRAH) or building air handling unit (AHU) fans and the loss of power to the chilled water pumps. When fan power is lost, airflow — and hence cooling — is not lost immediately. The fans will still contain considerable rotational inertia, and thus take time to come to a complete stop. Similarly, we should consider the fact that fans will also take time to restart!

When the chilled water pumps fail, water becomes stationary inside the CRAH cooling coils. Initially, the thermal inertia of the heat exchanger and the remaining water’s thermal inertia resists changes in temperature. As time progresses, the water temperature will eventually match the temperature of the air passing over the coils. Depending on the type of power redundancy system installed in the data center, one or both failure scenarios could be applicable.

Of course, not all the thermal inertia of the cooling system is confined to the CRAH heat exchangers. There is considerable thermal inertia stored in the chilled water loop. If a chiller was to fail, but the pumps continue to run on back-up power, a data center using the chilled water loop would see a significant delay before experiencing increases in CRAH supply temperatures. Depending on the amount of water in the chilled water loop, overheating issues may not occur for many minutes or even hours.

While we have talked primarily about the facility cooling system, it is important to not forget the IT equipment. Servers, much like CRAHs and other cooling system components, also contain thermal inertia. Unlike a CRAH, the thermal inertia of an individual server is much less than that of the large volume of metal and water in the CRAH heat exchanger; however, the server’s thermal inertia is strongly connected to the cooling air and the large number of servers may add the critical seconds needed for cooling systems to come back online.

Safeguard your data center with CFD simulation software

It is important for data center operators to understand how long they have for each particular scenario when planning their failure response protocols. CFD simulation software can simulate both these time-dependent aspects and various failure scenarios to enable data center managers to understand resilience time windows before other mitigations (such as application shutdown) need to be deployed to protect the IT. Likewise, CFD simulation software helps data center managers design the appropriate power redundancy system to meet that time requirement.

Learn how to future-proof designs and assess operational decisions in a safe, virtual environment using Cadence data center solutions.

Leave a Reply

(Note: This name will be displayed publicly)