C++ :: Andrew Moa Blog Site - Example site for hugo-theme-tailwind

Compile rocblas-rocm-6.2.4 under Windows

When I was demonstrating matrix operation acceleration before, I wanted to try AMD’s own ROCm. After compiling and running the program, I encountered an error:

rocBLAS error: Cannot read D:\example\efficiency_v3\rocm\build\Release\/rocblas/library/TensileLibrary.dat: No such file or directory for GPU arch : gfx1150
 List of available TensileLibrary Files :

According to the official website, ROCm does not support Radeon 880M integrated graphics (AI H 365w processor)¹. Unless you compile rocblas yourself, it will not work.

Code
C++

2025-06-27

12 minutes to read

Andrew Moa

Matrix multiplication operation (Ⅲ) - using MPI parallel acceleration

MPI is a parallel computing protocol and is currently the most commonly used interface program for high-performance computing clusters. MPI communicates through inter-process messages and can call multiple cores across nodes to perform parallel computing, which is not available in OpenMP. MPI is implemented on different platforms, such as MS-MPI and Intel MPI under Windows, OpenMPI and MPICH under Linux, etc.

1. MPI parallel acceleration loop calculation

1.1 C Implementation

MPI needs to initialize the main program interface and establish a message broadcast mechanism; at the same time, the array to be calculated is divided and broadcast to different processes. In the past, these operations were implemented internally by OpenMP or other parallel libraries, and programmers did not need to care about how the underlying implementation was implemented. However, using MPI requires programmers to manually allocate the global and local space of each process and control the broadcast of each message, which undoubtedly increases the additional learning cost.

2025-06-25

34 minutes to read

Andrew Moa

Matrix multiplication operation (Ⅱ) - Accelerated operation based on BLAS library

BLAS was originally developed as a linear algebra library using Fortran, and was later ported to C/C++. As a core component of modern high-performance computing, it has formed a set of standards. There are open source implementations such as Netlib BLAS, GotoBLAS and its successor OpenBLAS. Commercially, each manufacturer has corresponding implementations for its own platform, such as Intel’s MKL, NVIDIA’s CUDA, AMD’s AOCL and ROCm. Some of them are optimized for CPU platforms, and some use GPU parallel acceleration. This article uses different BLAS libraries to implement matrix operations and analyzes the performance differences between different implementations.

Code
C++

2025-06-24

23 minutes to read

Andrew Moa