Case Study

Case Study: German University

Maximizing Advanced Research with AI and Machine Learning

Industry: Research, academics, material science and engineering

Customer: a prominent German university

Introduction

This German university is renowned for its science and engineering programs in disciplines such as materials science, energy engineering, communication systems, and electrical engineering, ranking among the top 10 in Germany according to QS World University Rankings. The university has produced many outstanding scientists and engineers including multiple Nobel Prize winners.

The use of machine learning (ML) in chemistry and molecular dynamics (MD) has become increasingly important for many areas of research, including statistical physics, biology, and materials science. ML and MD have made the numerical simulation of real and complex physical models possible. The demand for simulating these models is growing exponentially, requiring state-of-the-art high-performance computing (HPC). To bolster its scientific research, the university sought an updated GPU cluster with the following requirements: 1) Provide at least 32 GPU nodes (ML nodes) for ML chemistry research with high double-precision computing capabilities; 2) Provide as many GPU nodes (MD nodes) as possible for MD research with high single-precision computing capabilities.

Challenges

1) Machine learning and chemistry research

The university is using machine learning extensively for experimental research in chemistry. Its ever-growing dataset now covers a wide range of chemistry research areas. Today, advanced ML models can achieve predictive accuracy of chemistry datasets using only 1-2% of the data. Such data efficiency and accuracy has caused a huge spike in demand for parallel HPC.

2) Molecular dynamics applications

The university uses MD to simulate the time-dependent properties of macromolecules, properties of biological systems such as protein systems, and pharmacology. These simulations require immense computing power, and the demand for these simulations is growing rapidly.

Solution

The Aivres AI+HPC solution using AMD-CPU servers has greatly enhanced the university’s scientific research capabilities. The floating-point performance for model inference and training exceeded the University’s original expectations by 115%. Based on the University’s requirements and initial requests, Aivres made alternative recommendations that were better optimized for the client’s use case. For the MD node, the university originally requested a 4xA100 NVLink server GPU architecture. Aivres recommended an 8xA100-Multi-Instance GPU (MIG) solution, which maximized performance while minimizing costs. For the ML nodes, the university original requested RTX3090 GPUs. Aivres recommended A40 GPUs, a next-generation data center GPU with better performance in CUDA Core, Tensor Core, video memory, computing power, etc. The university had relatively low requirements for server PCIe expansion, making Aivres NF5488A5 and NF5468A5 servers an ideal choice. Alternative server products offer more PCIe slot expansion capacity, but this is a costly feature the university will never fully utilize. Consequently, Aivres NF5488A5 and NF5468A5 servers were able to fulfill all project requirements, while also reducing procurement costs.

Detailed solution:

1) Create an MD node with advanced computing performance

Due to the complexity of molecular simulation theory, the requirements for HPC in this field are extremely high. Aivres recommended its NF5488A5 AI server. NF5488A5 leverages the industry’s most advanced NVIDIA NVSwitch interconnect structure with high-speed interconnection and AI performance, communication bandwidth, and video memory capacity of 8x NVIDIA SXM4 A100 Tensor Core 40GB GPUs, offering the highest possible performance in a 4U space. By configuring 2x AMD Rome PCIe4.0 CPUs with the XGMI-2 bus interconnect design, it provides powerful general computing performance that can match an assortment HPC MD requirements.

2) ML nodes supporting diverse machine learning tasks for chemistry research

Various software and algorithms for ML and chemistry research have diverse computing characteristics. Different software has very different utilization of the CPU, memory, and hard disk. Computing resource utilization also varies considerably depending on the task. Therefore, Aivres recommended an NF5468A5 AI server, which supports 8x third-generation NVLink fully interconnected NVIDIA A100 GPUs in a 4U space. A single machine can provide 5 petaFLOPS of training performance with extremely high data throughput. It can sufficiently support the calculation of massive ML chemistry data, while improving training efficiency.

Results and Impact

The university’s HPC system integrated Aivres AI servers is successfully running applications such as Tensorflow, PyTorch, QuantumEspresso, and VASP, and scientific research software such as NAMD, Lamps, AMBER, VMD, GROMACS, etc. In addition, the university performed HPL tests with HPC-Benchmarks: 21.4-hpl Docker image with a performance value of 80TFLOPS. With the HPL test requirements of this high-performance project, combined with the deep understanding of HPC in the field of scientific computing, Aivres’ professional HPC application analysis team provided a performance optimization solution for a 15% performance boost to 91TFLOPS that greatly exceeded the customer’s expectations.