Moderator: Dennis Abts (NVIDIA Corporation)
Panelists: Steve Scott (Microsoft Corporation), Duncan Roweth (Hewlett Packard Enterprise (HPE)), Larry Dennison (NVIDIA Corporation), Jeffrey Vetter (Oak Ridge National Laboratory (ORNL)), Venkatram Vishwanath (Argonne National Laboratory (ANL)), Alireza` Ghaffarkhah (Google LLC), Brian Towles (Google LLC)
Abstract: High-performance computing (HPC) systems and machine learning (ML) have many common design goals. Despite this commonality, large-scale systems are increasingly heterogeneous, with SmartNICs, DPUs, CPUs, GPUs, and FPGAs all intermingled in a system organization that can exploit that heterogeneity across the entire system. The interconnection network ties together these heterogeneous processing elements to provide a consistent system-wide programming model to ply those heterogeneous resources.
Every large-scale workload requires both computation and communication as two sides of the same coin – computed results must be communicated and consumed by other cooperating processing elements. This panel discussion seeks to explore whether domain-specific accelerators (GPUs, TPUs, TSPs, etc) require a similar domain-specific network to extract performance from the accelerator at the system level. This begs the question: “Are we converging (toward converged HPC/ML) or diverging for these performance-critical workloads?”