Iris Publishers
CUDA-based Accelerated FEM Calculation
Authored by Cankun Zheng
The Finite Element Method (FEM) had been
widely used in engineering applications. There are two bottleneck issues in the
FEM calculation. The first one is how to minimize computer memory demand so
that to be able to run large-scale simulations using limited computational resources.
The second one is how to carry out simulation as fast as possible to complete a
large-scale simulation job. A standard finite element analysis always results
in solving simultaneous linear equations. Overall, a global stiffness matrix of
a finite element simulation often has sparse, symmetric, and positive definite
characteristics. To overcome the first issue, we proposed integrating and
storing stiffness matrix using the Compressed Sparse Row (CSR) format in this
paper. We proposed to carry out parallel computing to solve simultaneous
equations using the cuBlAS, cuSPARSE, and cuSOLVER libraries on the Compute
Unified Device Architecture (CUDA) platform to overcome the second issue. The
aim of this article is to explore the possibility to solve simultaneous
equations with sparse matrix using CUDA libraries. Three algorithms of cuSOLVER
libraries were evaluated through case studies. The three algorithms include the
QR factorization, incomplete LU factorization, and the conjugate gradient
methods. We have evaluated the efficiency, accuracy and capacity of the
algorithms. Two type of case studies were investigated in the numerical
experiments. First type of sparse matrix were downloaded from the Florida
Sparse Matrix Collection and second type of matrix were assembled from truss
structure using FEM. Numerical experiment results showed that, the accuracy of
incomplete LU factorization is unacceptable, the capacity of the QR
factorization is weak, since the GPU memory demand is higher. The efficiency,
capacity and accuracy of the conjugate gradient methods all are satisfied.
However, it may break down for some particular cases.
Keywords: FEM; cuSPARSE; cuBLAS; cuSOLVER;
Compressed sparse row format matrix; QR Factorization; Incomplete LU
Factorization; Conjugate gradient
Introduction
For many large-scale FEM calculations,
solving sparse linear equations takes up most of the computing time and
resources. Taking the static analysis of structures using the finite element
method as an example, the solution of sparse linear equations usually accounts
for more than 70% of the whole analysis time [1]. With the continuous expansion
of engineering scale and the continuous improvement of engineering complexity,
higher requirements are put forward for the scale and speed of solving sparse
linear equations. Therefore, the use of high-performance computing is a
critical way to solve sparse linear equations.
Nowadays, the direct method and the
iterative method are the two most popular methods to solve sparse linear
equations. The direct process is to transform the coefficient matrix of the
equationsinto a triangular matrix and then use the back-substitution method or
the pursuit method to get the solution of the equations. The iterative approach
is a process of constructing an infinite sequence to approximate the exact
solution from an approximate initial valueof the solution. The rest of the paper
is organized as follows: Section 2 gives a brief description of the CUDA
platform. We focus onthree essential libraries of CUDA Toolkit, cuBLAS,
cuSOLVER, and cuSPARSE. Section 3 defines the CSR storage format, which was used
in this paper. Section 4 demonstrates how cuBLAS, cuSOLVER and cuSPARSE were
implemented in detail and error analysis. Section 5 shows two types of case
studies to solve sparse simultaneousequations. In the first type of case study,
the matrix was download from the Sparse Matrix Library [2]. The second type of
case study is an example to analyze truss structure using FEM. Section 6 is the
Global Journal of Engineering Sciences Volume 8-Issue 4Citation: Cankun Zheng,
Wenhao Lin, Jiujiang Zhu. CUDA-based Accelerated FEM Calculation. Glob J Eng
Sci. 8(4): 2021. GJES.MS.ID.000690. DOI: 10.33552/GJES.2021.08.000690.Page 2 of
8 conclusion.
CUDA
Platform
The CUDA platform allows developers to use
the C language as a development language and also supports other programming languages
or interfaces. Under the CUDA architecture, a program is divided into two
parts: the Host side, which is executed serially on the CPU, and the Device
side (also known as the kernel), which is implemented in parallel on the graphics
card. Usually, the execution model of a complete program is that the host side
prepares the data and copies it to the graphics card’s memory. The GPU executes
the kernel program of the device side for parallel calculation. After the
calculation is completed, the result is sent back to the host side from the
memory of the graphics card [3]. The cuBLAS library implements BLAS (Basic
Linear Algebra Subprograms) on top of the NVIDIA ®CUDATM runtime [4]. The cuBLAS
library is mainly used for matrix operation. It contains two sets of API. One
is the commonly used cuBLAS API, which requires users to allocate GPU memory
space and fill in data according to the specified format. The other one is
cuBLASXT API, which an assign data at
the CPU end, then call function. It will automatically manage memory and
perform computation [5]. In practice, the first set of APIs is usually used.
The cuSPARSE library contains a set of basic
linear algebra subroutines used for handling sparse matrices. The goal of the
standard library is a matrix with a certain number of (structurally) zero elements
that represent >95% of the total entries. It is implemented on top of the
NVIDIA ®CUDATM runtime (part of CUDA Toolkit). And it is designed to be called
from C and C++ [6].
cuSPARSE is acollection of column
interfaces. And the main advantage is that itonly needs to call the
corresponding API when the users write the host code. So, it can save a lot of
development time, and this library also supports many data formats. cuSlOLVER
is an advanced library. It is based on the cuBLAS and cuSPARES library and
includes matrix factorization and solving equations. It contains three independent
library files, including cuolverDN, cuSolverSP, and cuSolverRF. The cuSolverDN
can performdense matrix factorization (LU, QR, SVD, LDLT), while the cuSolverSP
is a new toolset to manipulate sparse matrices, such as to solvesimultaneous
equations with sparse matrix. The cuSolverRF is asparse refactoring package
[7].
Storage
Format
In solving the simultaneous sparse linear
equation, the coefficient matrix is sparse. Therefore, the selection of storage
format is crucial to reduce the storage of unnecessary zeros, and it is also imperative
to improve the performance of its operation. We mainly introduce the compressed
sparse row format used in this paper. CSR format is a mainstream general sparse
matrix representation format, which stores column indices and nonzero values in
array columns and array values [8].
Implementation
Method and Relative Error Analysis Based on cuBlAS, cuSPARSE, and cuSOLVER
The solution methods of large sparse linear
equations usually include the direct method and the iterative method. QR
Factorization QR factorization is based on matrix factorization. Based on elimination,
the linear equations are transformed into several equivalent sub-problems,
which are easy to be calculated and solved in turn. Theoretically, the exact
solution of the system of equations can be obtained by the direct method in a
fixed number of steps without considering the accumulative numerical error.
• Apply for variable memory space on the
GPU and transfer Col, row, Val and b from the CPU to the GPU.
• Find the final X value by calling
cusolverSpDcsrlsvqr.
• Return the X value from the GPU side to
the CPU side.
Incomplete
LU Factorization
Incomplete LU factorization is an iterative
solution method, which is related to the LU factorization method. The LU
factorization method is a direct solution method. When the matrix A is asparse
matrix, the matrix L and U lose the sparse characteristics of the matrix A. The
storage cost of the matrix that loses the sparse property will increase, so the
incomplete LU factorization method Citation: Cankun Zheng, Wenhao Lin, Jiujiang
Zhu. CUDA-based Accelerated FEM Calculation. Glob J Eng Sci. 8(4): 2021.
GJES.MS.ID.000690. DOI: 10.33552/GJES.2021.08.000690.Global Journal of
Engineering Sciences Volume 8-Issue 4 Page 3 of 8 is proposed in [9], so that
the iterative matrix K in A = K – R satisfies:
K = L X U and the matrices L and U have the
sparse property. The algorithm of incomplete LU factorization can be described as
follows: Operations on sparse matrices stored in CSR format have been provided
in the cuSPARSE library. Call the cusparseDcsrilu02 function to decompose the
matrix into a unit lower triangular matrix Land an upper triangular matrix U.
Call the cusparseDcsrsv2 _ solve function to find the value of the matrix Z,
and then repeat the call to find the final value X. In this way, the algorithm
of incomplete LU factorization is completed by calling the corresponding
function through the cuSPARSE library. This article refers to this algorithm as
cuSPARSE_LU. The specific algorithm flow is as follows:
• Apply for variable memory space on the
GPU and transfer Col, row, Val and b from the CPU to the GPU.
• By calling cusparseCreateMatDescr,
cusparseSetMatIndex Base, cusparseSetMatType, cusparSESetMatFillMode, CusparseSetMatDiagType
creates descriptors of matrix A, matrix L, and matrix U, and opens the
corresponding memory by calling csru02Info_t and csrsv2Info_t.
• Analyze the LU by calling
cusparseDcsru02_analysis, cusparseDcsrsv2_analysis.
• The matrix K is decomposed into a lower
triangular matrix Land an upper triangular matrix U by calling
cusparseDcsrilu02 and cusparseXcsrilu02_zeroPivot.
• Solve the value of matrix Z by calling
cusparseDcsrsv2_to solve.
• Find the final X value by calling
cusparseDcsrsv2_to solve.
• Return the X value from the GPU side to
the CPU side.
Conjugate
gradient method
The conjugate gradient method is one of the
most commonly used iterative methods for solving linear equations [10]. The conjugate
gradient method uses successive approximation, generally using the iterative
formula to get a series of approximate solutionsgradually approaching the real
solution, and finally get the approximate solution satisfying certain error
tolerance.
The conjugate gradient algorithm is
described as follows:
1) Set Initial guessed value X0.
The solution of this conjugate gradient
method calls the cuBLAS and cuSPARSE libraries. Although each step of the
conjugate gradient method is serial, it mainly involves sparse matrix and
vector multiplication, vector and vector dot multiplication, vector update, and
scalar division in the whole algorithm process. These operations can all
involve data-level parallelism, thus enabling GPU parallel computing. That is
to say, GPU is responsible for the parallel calculation between matrix and
vector, vector and vector before iteration and in each iteration, and CPU is
responsible for the control of iteration loop and convergence condition, as
well as the operation of scalar multiplication. Operations on sparse matrices
store in CSR format have been provided in the cuSPARSE library. By calling the
cusparseSpMV function in the cuSPARSE library, a sparse matrix stored in CSR
format can be multiplied by a vector. The operation of two-vector dot
multiplication can be realized by calling the cublasDdot function in the cuBLAS
library. The vector update is implemented by calling the cublasDaxpy function
in the cuBLAS library. In this way, the cuSPARSE library and the cuBLAS library
can be used to call the corresponding functions for the conjugate gradient
algorithm. This article refers to this algorithm as cuSPARSE_CG.
The specific algorithm flow path is as
follows:
Global Journal of Engineering Sciences
Volume 8-Issue 4Citation: Cankun Zheng, Wenhao Lin, Jiujiang Zhu. CUDA-based
Accelerated FEM Calculation. Glob J Eng Sci. 8(4): 2021. GJES.MS.ID.000690.
DOI: 10.33552/GJES.2021.08.000690.
Page 4 of 8
• Apply memory space for variables on the
GPU, and trans-
Comparison of QR Factorization, Incomplete
LU Factorization and Conjugate Gradient Method for Solving Truss Structure
using FEM
The overall stiffness matrix of the truss
elements in this test is simple, the length L, cross-sectional area A and
elastic modulus E of the truss are all set as 1, and the relative residual
error is set at 10-6. The simple overall stiffness matrix results in that the
QR factorization and incomplete LU factorization method have break neck
calculation speeds. For example, to solve 100000 orders matrix equations, the
computational time of the conjugate gradient is about 4.07 times more than that
of the incomplete LU factorization, while the conjugate gradient computational
time is about 2.64 times more than that of the QR factorization. Table 4
compares the computation time of solving the global stiffness matrix of the
truss elements using the QR factorization, incomplete LU factorization, and the
conjugate gradient method. The relative error analysis and comparison are shown
in Table 5. The accuracies of both QR factorization and incomplete LU
factorization are satisfied. For large-scale simulation, the conjugate gradient
method was not convergent, i.e. the iteration steps are less than the total
freedom of the system. For this particular structure, the conjugate gradient
method only convergent when iteration steps equal the total freedom of the
system. If the iteration steps equal the total freedom of the system, iteration
solution equals exact solution. Table 4: Comparison of calculation time of
three algorithms for truss structure.
Reason
analysis: Although the computational
efficiency of the conjugate gradient method and QR factorization in the Florida
sparse library is slower than the incomplete LU factorization, the relative
error of QR factorization and the conjugate gradient is much smaller than the
incomplete LU factorization. Due to in incomplete LU factorization, some
elements of the matrix R in Eq. (5) were dopped off; this causes incomplete LU
factorization to be just an approximate method. Our testing results demonstrate
that the accuracy of incomplete LU factorization is unacceptable for accurate
engineering simulation. In practice, usually, incomplete LU factorization is
only suitable to be used as a preprocessing. Although
both efficiency and accuracy of QR
factorization are satisfied, our numerical experiments show that the GPU memory
consumption of QR factorization is much higher than the other two methods. It means
QR factorization is not suitable for large scall simulation, at least for some
cheap GPUs. Our testing shows that, in most cases, the efficiency, accuracy, and
capacity of conjugate gradient all are good enough for FEM simulation; however
conjugate gradient may break down for some particular cases. For example, in
our truss structure FEM simulation, if the iteration step is less than the total
freedoms of the system, the residual error keeps as constant without change. If
and only if when the iteration step equals the freedoms of the system, the residual
error becomes zero, i.e., the iteration solution reaches the exact solution.
For large-scale simulation, it is not feasible to make iteration steps equal
the freedoms of the system. It needs a very long iteration time.
Conclusion
and Future Work
In this paper, we have compared the
computational efficiency, accuracy, and capability of three algorithms: QR
factorization, incomplete LU factorization, and conjugate gradient method using
cuBlAS, cuSPARSE, and cuSOLVER libraries. We have reached the following conclusions:
In terms of computational efficiency, the
incomplete LU factorization is the fastest, and the conjugate gradient is the
slowest. The running speed of the incomplete LU factorization is 11.6 times
faster than that of the conjugate gradient. The running speed of the QR
factorization is 3.3 times faster than that of the conjugate gradient. In
terms of computational accuracy, both the QR factorization and the conjugate
gradient methods are good enough for accurate engineering simulation. The
incomplete LU factorization method is unacceptable. In terms of computational
capacity, both the incomplete LU factorization and the conjugate gradient
methods are acceptable. The QR factorization method requires too much GPU memory,
so it is not suitable for large-scale simulation. To compare all three
algorithms, we will find the conjugate gradient method is the best candidate
for large-scale FEM simulation. However, the conjugate gradient method may
break down for some particular cases. We proposed to combine the incomplete LU
factorization and the conjugate gradient method in our future work. To use
incomplete LU factorization as a preprocessing to carry out the precondition
conjugate gradient simulation.
Acknowledgment
This research is supported by the project
of “High Education funding of Guangdong Province: 2018KZDXM072”; the College
Students’ Innovation and Entrepreneurship Project: 5031700602O6; the Natural
Science Foundation of Guangdong Province:
Conflict
of Interest
No conflict of interest.
To read more about
this article: https://irispublishers.com/gjes/pdf/GJES.MS.ID.000691.pdf
Indexing List of Iris
Publishers: https://medium.com/@irispublishers/what-is-the-indexing-list-of-iris-publishers-4ace353e4eee
Iris
publishers google scholar citations: https://scholar.google.co.in/scholar?hl=en&as_sdt=0%2C5&q=irispublishers&btnG=
Comments
Post a Comment