Andrew Moa Blog Site

Matrix multiplication operation (Ⅲ) - using MPI parallel acceleration

MPI is a parallel computing protocol and is currently the most commonly used interface program for high-performance computing clusters. MPI communicates through inter-process messages and can call multiple cores across nodes to perform parallel computing, which is not available in OpenMP. MPI is implemented on different platforms, such as MS-MPI and Intel MPI under Windows, OpenMPI and MPICH under Linux, etc.

1. MPI parallel acceleration loop calculation

1.1 C Implementation

MPI needs to initialize the main program interface and establish a message broadcast mechanism; at the same time, the array to be calculated is divided and broadcast to different processes. In the past, these operations were implemented internally by OpenMP or other parallel libraries, and programmers did not need to care about how the underlying implementation was implemented. However, using MPI requires programmers to manually allocate the global and local space of each process and control the broadcast of each message, which undoubtedly increases the additional learning cost.

2025-06-25

34 minutes to read

Andrew Moa

Matrix multiplication operation (Ⅱ) - Accelerated operation based on BLAS library

BLAS was originally developed as a linear algebra library using Fortran, and was later ported to C/C++. As a core component of modern high-performance computing, it has formed a set of standards. There are open source implementations such as Netlib BLAS, GotoBLAS and its successor OpenBLAS. Commercially, each manufacturer has corresponding implementations for its own platform, such as Intel’s MKL, NVIDIA’s CUDA, AMD’s AOCL and ROCm. Some of them are optimized for CPU platforms, and some use GPU parallel acceleration. This article uses different BLAS libraries to implement matrix operations and analyzes the performance differences between different implementations.

Code
C++

2025-06-24

23 minutes to read

Andrew Moa

Matrix multiplication operation (I) - using OpenMP to speed up loop calculation

Speaking of matrices, anyone who studies science and engineering will think of the fear of being dominated by linear algebra classes. Matrix multiplication operations are indispensable for various industrial and scientific research numerical calculations, and are also used in various benchmarking software. The time consumption of matrix multiplication operations is also an important indicator for judging the floating-point operation performance of computers. The purpose of this article is to verify the performance differences of various implementation methods through matrix multiplication operations, and compare the performance differences of different computing platforms to provide a reference for high-performance computing development.

2025-06-23

42 minutes to read

Andrew Moa