Jan 15, 2022

Parallel and High-Performance Computing Project

For the course Design of Parallel and High-Performance Computing we were tasked to implement and optimize an algorithm in a distributed and optimized manner. We decided to analyze, implement and optimize the Distributed sum of outer products (DSOP), an operation commonly used in deep learning applications. An approach to accelerate training deep neural networks is to replicate the model across multiple nodes and have each node independently compute forward and backward propagation. However, the gradients need to be synchronized. Here is where DSOP comes in.

The operation is as follows: There are a p nodes, each with two vectors of size n and m. Now, we need to compute the outer product of each pair of vectors (resulting in a matrix of size nm) and add them all together. In the end, every node must have a copy of this matrix.

There are two obvious first approaches: Either, we first distribute all the vectors, and then each node computes the outer product of each pair and subsequently sums them up. Or each node first computes the outer product with the vectors it already has, then we distribute all matrices, and finally, each node sums them up. In the first approach there is a lot of redundant computation but minimal data transfer (each node receives p*(n+m) values). The second approach avoids the redundant computation for the outer product but potentially sends around way more data (each node receives p*n*m values).

In this project, we’ve explored many different approaches, both as models as well as concrete implementations. We tested and benchmarked them on ETH Zurich’s HPC cluster Euler III. In the final version we’ve come up with, each node is assigned to compute a chunk of the final matrix before it is distributed. This approach yielded a median speedup of around 2.5x. A complete overview can be found in our paper, along with the source code on Github.

I learned a lot from this project. First and foremost, after intensive lectures in the field of parallel and high-performance computing, this project allowed me to go in-depth and validate and apply the learned theory. This was also the first serious project for me to write C++ and MPI code, along with running everything on an HPC cluster. As this was a project of 5 people, a solid amount of communication and coordination between the students was necessary. While all students took over that role in parts, it helped me gain some experience in managing, evaluating, and coordinating work.