Understanding The Total Cost Of Ownership In HPC And AI Systems

The initial purchase cost typically accounts for only about half of the total expenses incurred over a system’s useful life.

popularity

Cost is often the deciding factor when it comes to purchasing decisions at an organization, particularly those dealing with high-tech investments. When organizations evaluate proposals for new procurements, the initial capital cost of the system often receives significant attention. A great deal of preparation and planning goes into the decision to make a large purchase. While this is a critical component for assessing return on investment (ROI), a joint survey conducted by Hyperion Research, Intel, and Ansys indicates that the initial purchase cost typically accounts for only about half of the total expenses incurred over the system’s useful life. The other half is composed of the maintenance and continuing functionality of the purchased product or feature. It is important for an organization to understand the total cost of ownership (TCO) to fully evaluate ROI.

The complexity of TCO in HPC and AI

Employers use various methods to access TCO calculations for high-performance computing (HPC), artificial intelligence (AI), and advanced technical computing. The complexity of these methods further challenges organizations when making cost-related decisions.

These methods include:

  • On-premises infrastructure through capital expenditures (CapEx)
  • On-premises infrastructure expensed through operating expense-based (OpEx) business models
  • OpEx-based infrastructure using cloud resources

Each approach encompasses a wide range of expenses influenced by corporate governance directives, IT organizational practices, and industry-specific needs. Additionally, the diversity of simulation workloads complicates TCO comparisons. These factors play a central role in the preparation of TCO calculations, which in turn informs ROI predictions. But why are these TCO predictions so crucial?

Expected life span of on-premises high-performance computing (HPC)/artificial intelligence (AI) systems

Importance of TCO

Organizations must continually invest in HPC, AI, and advanced computing to achieve their missions. However, with limited budget increases, providing credible ROI estimates is essential for justifying these investments. Accurate TCO analyses, including all associated costs, are crucial for planning procurements and securing funding from senior management. Comparing on-premises and cloud models side by side helps outline the advantages and disadvantages to each approach.

On-premises versus cloud TCO model profiles

On-premises TCO model

On-premises TCO calculations include various elements, such as:

  • Air cooling systems
  • Employee salaries
  • Employee training
  • Energy consumption
  • Facilities-related costs
  • Liquid cooling systems
  • Planned system downtime
  • Subscription-based licensing
  • System maintenance and support

These elements can be grouped into three profiles — simple, moderate, or complex — with each profile building on the previous one. More comprehensive models include additional elements, providing a more complete and credible analysis.

Complex Moderate Simple
Initial purchase price X X X
System maintenance and support X X X
Energy consumption X X X
Subscription-based licensing X X X
Air cooling systems X X
Employee salaries X X
Facilities-related costs X X
Liquid cooling systems X
Employee training X
Planned system downtime X

Cloud TCO model

Cloud TCO calculations also encompass a wide range of elements, including:

  • Cloud computing costs
  • Storage costs
  • Software licensing costs
  • Networking costs
  • Staff and HPC specialist salaries
  • Employee training costs
  • Additional managed services costs
  • Data egress fees

Cloud TCO models similarly can be grouped into three categories, with more elements leading to a more thorough analysis.

Complex Moderate Simple
Compute X X X
Storage X X X
Software licensing X X X
Networking X X X
Staff and specialist salaries X X
Employee training X X
Additional managed services X
Data egress X

Annual TCO model expenses

For on-premises systems, annual expenses for elements like energy consumption, employee salaries, facilities-related costs, and liquid cooling systems often exceed $1 million each. In contrast, average annual cloud TCO expenses are lower, with major costs including cloud compute, storage, and specialist salaries.

Estimated annual expenses for each on-premises TCO element

Impact of GPUs on TCO models

The adoption of graphics processing units (GPUs) in both on-premises and cloud environments significantly affects TCO. GPUs increase the initial purchase price and drive up power consumption and cooling costs. Survey data shows varying degrees of GPU use across different applications and verticals, impacting TCO calculations.

On-premises application runtime percentage usages on graphics processing units (GPUs)

Comparing on-premises and cloud TCO models

Comparing on-premises and cloud TCO models is complex due to the CapEx nature of on-premises infrastructure versus the OpEx nature of cloud resources. Employee salaries and training are significant annual operating expenses in both models. However, differences in organizational and governance models further complicate the comparison. For instance, some organizations integrate their on-premises HPC infrastructure in general IT budgets while others maintain independent HPC data centers.

When comparing spending, it’s essential to note that workloads may vary significantly. Even at a single site running the same workload both on-premises and in the cloud, the cloud HPC workload isn’t a direct copy of the on-premises workload. Therefore, true comparisons should include other factors like performance — for example, time to solution — and feature capabilities.

TCO best practices for users and vendors

Organizations should develop credible TCO analyses to justify major expenditures, especially in an environment of tight IT budgets. Steps to create a TCO model include:

  • Identifying data center expenses
  • Prioritizing expenses
  • Coordinating with finance for anticipated allocations
  • Understanding management’s emphasis on CapEx versus OpEx
  • Testing the model with previous procurements

A comprehensive TCO model facilitates apples-to-apples comparisons between on-premises and cloud resources, despite the inherent complexity.

Calculating a better future with HPC and AI

Understanding and calculating TCO is vital for organizations investing in HPC, AI, and advanced computing resources. Vendors should assist customers in developing credible TCO models by providing evidence of how their solutions optimize and minimize expenses while also offering flexibility in accommodating user-specific TCO model requests. In addition, real-world comparisons of on-premises and cloud resource consumption should be included in TCO analyses.

Accurate TCO models provide a “big picture” look at expenses. They enable better ROI assessments, justify expenditures, and support efficient resource use. Both users and vendors play crucial roles in developing and using these models to drive informed decisions in technology investments. When organizations are more informed on TCO over initial costs, they are better prepared to make long-term investments.

To learn more, register for the webinar “Total Cost of Ownership of HPC and AI for Engineering.” Or download the full survey here.



Leave a Reply


(Note: This name will be displayed publicly)