矩阵乘法运算(二)-基于BLAS库的加速运算
BLAS最初是采用fortran开发的线性代数库,后来移植到C/C++上,作为现代高性能计算的核心组件,已经形成了一套标准。有开源的实现如Netlib BLAS、GotoBLAS及其后继者OpenBLAS,商业上各个厂商针对自家平台都有相应的实现,比如Intel的MKL、NVIDIA的CUDA、AMD的AOCL和ROCm。其中有针对CPU平台进行优化的,也有采用GPU并行加速的。本文通过使用不同BLAS库实现矩阵运算,分析不同实现间的性能差异。
1. CPU并行加速BLAS库
1.1 Intel MKL
main.c文件同矩阵乘法运算(一)-使用OpenMP加速循环计算
中的2.1,blas.c引入mkl的blas库,并使用gemm函数执行矩阵乘法运算。
#include <mkl_cblas.h>
void matrix_multiply_float(int n, float A[], float B[], float C[])
{
cblas_sgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
n, n, n, 1.0, A, n, B, n, 0.0, C, n);
}
void matrix_multiply_double(int n, double A[], double B[], double C[])
{
cblas_dgemm(CblasRowMajor, CblasNoTrans, CblasNoTrans,
n, n, n, 1.0, A, n, B, n, 0.0, C, n);
}
CMakeLists.txt包含了mkl和openmp库文件,mkl底层默认使用openmp进行并行化,因此要链接到openmp库。
cmake_minimum_required(VERSION 3.13)
project(mkl LANGUAGES C)
set(CMAKE_C_STANDARD 11)
set(EXECUTE_FILE_NAME ${PROJECT_NAME}_${CMAKE_C_COMPILER_FRONTEND_VARIANT}_${CMAKE_C_COMPILER_ID}_${CMAKE_C_COMPILER_VERSION})
string(TOLOWER ${EXECUTE_FILE_NAME} EXECUTE_FILE_NAME)
set(MKL_LINK static)
# Enable OpenMP
find_package(OpenMP REQUIRED)
# Enable MKL
find_package(MKL CONFIG REQUIRED)
set(SRC_LIST
src/main.c
src/blas.c
)
add_executable(${EXECUTE_FILE_NAME} ${SRC_LIST})
target_compile_options(${EXECUTE_FILE_NAME} PUBLIC
$<TARGET_PROPERTY:MKL::MKL,INTERFACE_COMPILE_OPTIONS>
)
target_include_directories(${EXECUTE_FILE_NAME} PUBLIC
$<TARGET_PROPERTY:MKL::MKL,INTERFACE_INCLUDE_DIRECTORIES>
)
target_link_libraries(${EXECUTE_FILE_NAME} PUBLIC
OpenMP::OpenMP_C
$<LINK_ONLY:MKL::MKL>
)
编译机器是AMD笔记本,处理器是AI 9 365w,在Windows下使用vs2022的clang-cl编译并运行,Release程序执行效果如下:
PS D:\example\efficiency_v3\c\mkl\build\Release> ."D:/example/efficiency_v3/c/mkl/build/Release/mkl_msvc_clang_19.1.5.exe" -l 10 -n 12
Using float precision for matrix multiplication.
1 : 4096 x 4096 Matrix multiply wall time : 0.218935s(627.761Gflops)
2 : 4096 x 4096 Matrix multiply wall time : 0.211711s(649.183Gflops)
3 : 4096 x 4096 Matrix multiply wall time : 0.215178s(638.722Gflops)
4 : 4096 x 4096 Matrix multiply wall time : 0.223452s(615.072Gflops)
5 : 4096 x 4096 Matrix multiply wall time : 0.202687s(678.085Gflops)
6 : 4096 x 4096 Matrix multiply wall time : 0.203175s(676.455Gflops)
7 : 4096 x 4096 Matrix multiply wall time : 0.225790s(608.702Gflops)
8 : 4096 x 4096 Matrix multiply wall time : 0.204435s(672.287Gflops)
9 : 4096 x 4096 Matrix multiply wall time : 0.217666s(631.421Gflops)
10 : 4096 x 4096 Matrix multiply wall time : 0.217374s(632.270Gflops)
Average Gflops: 642.996, Max Gflops: 678.085
Average Time: 0.214040s, Min Time: 0.202687s
PS D:\example\efficiency_v3\c\mkl\build\Release> ."D:/example/efficiency_v3/c/mkl/build/Release/mkl_msvc_clang_19.1.5.exe" -l 10 -n 12 -double
Using double precision for matrix multiplication.
1 : 4096 x 4096 Matrix multiply wall time : 0.400238s(343.393Gflops)
2 : 4096 x 4096 Matrix multiply wall time : 0.365257s(376.280Gflops)
3 : 4096 x 4096 Matrix multiply wall time : 0.375613s(365.906Gflops)
4 : 4096 x 4096 Matrix multiply wall time : 0.353108s(389.226Gflops)
5 : 4096 x 4096 Matrix multiply wall time : 0.380444s(361.260Gflops)
6 : 4096 x 4096 Matrix multiply wall time : 0.381736s(360.036Gflops)
7 : 4096 x 4096 Matrix multiply wall time : 0.392378s(350.272Gflops)
8 : 4096 x 4096 Matrix multiply wall time : 0.382949s(358.897Gflops)
9 : 4096 x 4096 Matrix multiply wall time : 0.401440s(342.365Gflops)
10 : 4096 x 4096 Matrix multiply wall time : 0.413794s(332.143Gflops)
Average Gflops: 357.978, Max Gflops: 389.226
Average Time: 0.384696s, Min Time: 0.353108s
想要在AMD机器上正常运行Intel mkl程序,建议在环境变量中加入MKL_DEBUG_CPU_TYPE=5。