Nvidia Air Redefines Infrastructure-as-Code

Auto3

We’re excited to bring Transform 2022 back in person on July 19 and virtually July 20 – August 3. Join AI and data leaders for insightful conversations and exciting networking opportunities. Learn more

Follow along with VentureBeat’s coverage of Nvidia’s GTC 2022 event >>

Enterprises are increasingly leveraging Infrastructure-as-code (IaC) to systematically provision cloud resources and containerized workloads. IaC is a critical part of modern software development pipelines that ensures consistency and helps companies respond to problems or experiment with new business ideas.

At the Nvidia GTC conference, Nvidia engineers detailed their work to build a digital twin of data center infrastructure. This work promises to extend IaC and continuous integration/continuous deployment practices to physical data center design.

Nvidia has used these new tools internally to improve its own data center design and is now starting to integrate them into Nvidia Air. This will complement other offerings from digital twins such as Nvidia Drive for autonomous vehicles, Isaac for robots and Clara for healthcare.

With Nvidia Air, enterprises can build a complete digital twin of a data center’s physical and logical layout before installing the first switch in the data center. They can then continue to use the same simulations, visualizations and AI tools once the data center is in production. Today, most design assets are essentially put away and forgotten once a data center goes live, which in many ways reflects the old waterfall style of testing and development before Agile emerged.

lost assets

These challenges are only becoming more complex with the need for new AI infrastructure that pushes the boundaries of computing, networking, storage, power and thermal management. Marc Hamilton, Nvidia vice president of solution architecture and engineering “Many classic supercomputers cost millions of dollars and take months or even years to implement,” said Marc Hamilton, vice president of solutions architecture and engineering at Nvidia.

Designing a data center is an extremely complex team sport with diverse skills. The building of the data center itself and the arrangement of racks and other components can be done in Autodesk. The cables, servers, switches and storage are designed with various 3D CAD tools.

Teams often turn to other airflow and heat modeling tools using Ansys computational fluid dynamics simulations. These kinds of simulations are usually done in design, but once the computer goes into production, the operations team never sees them. If a problem arises, the operations team must start over to figure out how to improve airflow or address an overheating problem.

Nvidia worked with design tools from many vendors in the past, and the resulting files were incompatible between engineering teams. It was generally a time consuming process to transfer files between tools and in some cases the formats were not compatible. When an engineer changed the layout to improve thermal properties, it wasn’t always passed on to the team designing heat sinks or cable routing.

Design for reuse

So Nvidia turned to the Omniverse to see if there was a better way to connect these workflows. Omniverse is built on top of a common database called Nucleus, which allows all engineering tools to present their data in a shared format for tools and teams. The Omniverse helps teams move back and forth between the photorealistic view of the data center as-built, overlaid with live thermal data, to analyze the predicted impact of various changes, such as moving two busy servers further apart.

Most engineering simulations are done with powerful workstations. The Omniverse empowers teams to move more of the complex engineering and simulation workloads to tens of thousands of GPUs in the cloud and then share the result with the enterprise and partners.

Another benefit of feedback to the Omniverse is that new simulations can take advantage of improvements in the core algorithms. One of the most important aspects of data center design is the computational fluid dynamics to understand the airflow, heating and cooling of the system. Hamilton’s team worked with Nvidia Modulus, a software development kit that uses AI to build surrogate models for physics. This allows them to simulate many more scenarios, such as small differences in temperature settings or physical placement in the same time.

Now Nvidia is extending these modeling capabilities to its data center management tools called Base Command. This provides a set of tools for monitoring and managing services. Today, when conditions change in the data center, such as a temperature spike, teams have only a rough idea of ​​what could be causing it.

Now Nvidia is exploring ways to extend Omniverse simulation capabilities to support logic infrastructure as well. This makes it easier to develop and test best practices for network setup, power line construction, and other things. This was one of the reasons Nvidia acquired Mellanox. “We started thinking about applying tools like omniverse for simulation, prediction and monitoring before making changes to the network,” Hamilton says.

Hardware Devops

Amit Katz, vice president of Nvidia Spectrum Platform, said the use of digital twins in data center designs is akin to the adoption of automation in the data center at the turn of the century. In the 1990s, engineers typically typed the CLI command in live data center environments. And sometimes they typed the wrong commands.

At the turn of the century, developers began to deliver and develop IaC against test environments that mimicked the real thing. Tools such as service virtualization and test harnesses allowed teams to simulate API calls to enterprise and external services before putting things into production. Now, in 2022, he believes the world is going through a similar transition to simulate physical infrastructure as well.

Katz said, “We’re seeing digital twins for end-to-end data center validation, not just for switches, but for the entire data center.” Later, Nvidia Air could act as a recommendation engine for proposing and prioritizing fixes and changes to data center designs and layout.

This can also simplify the exchange of assets and configurations between teams. In the same way that IaC made sure that developers, testing and operations teams were working with the same code. This will extend the same benefits to developers, network operators and data scientists using this infrastructure.

The vision is that the digital twins help teams set up the data center down to every cable entry. Then, as teams begin installing systems, the digital twin makes it easier to make sure each cable is routed correctly and, if not, what needs to change. Then if something goes wrong, such as a malfunction or a power supply, the digital twin can help you test different solutions. Teams can pre-test different solutions to make changes with greater confidence in success.

This would help close the loop between the greater flexibility available in the cloud and the better cost available for on-premise deployments.

“You can think of it as cloud flexibility with on-premises economy,” Katz said.

VentureBeat’s mission is to be a digital city square for tech decision makers to learn about transformative business technology and transactions. Learn more

This post Nvidia Air Redefines Infrastructure-as-Code

was original published at “https://venturebeat.com/2022/03/25/nvidia-air-gives-new-meaning-to-infrastructure-as-code/”