The difficulty of parallel programming has led to conspicuous trends in HPC: lagging performance of legacy codes & emergence of the computational scientist.
Researchers and programmers need to know more about underlying hardware, they are spending more time on software development and the problem is getting more acute with successive generations of hardware.
In my previous two posts here and here I've discussed trends in HPC that related to hardware evolution. The first was the saturation of processor clock speed and concomitant increase in on-processor core count and the second was the emergence of new compute architectures, most notably GPUs in the last decade. In this post I will discuss the growing challenge of parallel computing for technical software developers.
A modern high performance parallel code has at least three levels of parallelism. MPI is used at the top level, across nodes. A node here is defined as a rack-mounted computer running a single instance of the operating system, usually containing two multi-core processors. Within the node there is a thread level parallelism implemented in either OpenMP or a lighter weight protocol like pthreads. Finally, at the core level SIMD parallelism is implemented in AVX/AVX2. The latter are registers that allow the processor to operate on vectors of data in a single clock cycle. For example, the Intel Haswell chip can operate with vectors of 8 single precision floating point values all at once. Addition, subtraction, multiplication, division and other operations, can be accomplished simultaneously on all 8 values in the vector. Programming effectively at three different levels of parallelism on multiple computational kernels is exceptionally challenging. Where CPU codes have the general structure of MPI+OpenMP+AVX2, GPU codes have MPI+CUDA. In either case there is significant complexity associated with developing efficient parallel algorithms.
Over the history of HPC, most codes have migrated from being primarily compute bound in the early days to primarily memory bound for at least the last decade. A compute bound code is limited by how fast the processor can execute instructions while a memory bound code is limited by how fast it is receiving data from the memory system. Increasing the processor clock would benefit the former and have no effect on the latter. Advances in processor capability have outpaced the capacity of memory systems to deliver data. The multi-core transition in the mid-2000s and the expanding length of the SIMD vector registers has exacerbated this problem by aggregating the demands of multiple cores that have to be fed more data from the same data pipe. To address this so-called "memory wall" CPU vendors have increased cache size and added more layers of memory. This helps because data that is reused in a calculation is more likely to be found in faster access cache than back in main memory where access is very slow. GPU vendors have taken a different approach which minimizes latency by interlacing the requests from thousands of simultaneous and independent threads, sort of a hyperthreading approach on steroids. Optimizing a memory bound application is inherently more difficult than a compute bound code because of the complexity and variety of multi-level caches. Compilers do some of the work but the best results in my experience are from careful and attentive coding. To produce optimally performing parallel codes today, developers require a greater understanding of the hardware architecture and memory layout than in the past.
Where hardware evolution is guided by Moore's law I like to say that software is guided by the Law of More. Researchers and programmers need to know more about underlying hardware, they are spending more time on software development and the problem is getting more acute with successive generations of hardware. High performance parallel computing is hard and getting harder. The emergence of new architectures like GPUs and the eventual arrival of Intel Xeon Phi will continue this trend. This growing difficulty of parallel programming has led to two additional conspicuous trends in HPC, the lagging performance of legacy codes and the emergence of the computational scientist.
This post is one of several based on themes I presented in a keynote talk delivered in mid-September 2015 at the 2nd annual EAGE workshop on High Performance Computing in the Upstream in Dubai.
Vincent Natoli is the president and founder of Stone Ridge Technology. He is a computational physicist with 30 years experience in the field of high-performance computing. He holds Bachelors and Masters degrees from MIT, a PhD in Physics from the University of Illinois Urbana-Champaign and a Masters in Technology Management from the Wharton School at the University of Pennsylvania.
What we are doing to help improve the reservoir simulation industry.
ECHELON now supports AMD Instinct acceleratorsRead article →
For our first benchmark study on the NVIDIA Hopper architecture, we present ECHELON performance numbers for 10 asset models executing on the NVIDIA Volta, NVIDIA Ampere, and NVIDIA Hopper architectures. The longstanding trend of performance scaling with memory bandwidth continues.Read article →
Leveraging modern GPUs and neural computing libraries, scientists are using deep learning technology to solve differential equations that emerge from a diverse set of physics-based problems. This blog introduces the topic and explores potential applications and limitations of the technology.Read article →