Iris Publishers

Authored by Cankun Zheng

The Finite Element Method (FEM) had been widely used in engineering applications. There are two bottleneck issues in the FEM calculation. The first one is how to minimize computer memory demand so that to be able to run large-scale simulations using limited computational resources. The second one is how to carry out simulation as fast as possible to complete a large-scale simulation job. A standard finite element analysis always results in solving simultaneous linear equations. Overall, a global stiffness matrix of a finite element simulation often has sparse, symmetric, and positive definite characteristics. To overcome the first issue, we proposed integrating and storing stiffness matrix using the Compressed Sparse Row (CSR) format in this paper. We proposed to carry out parallel computing to solve simultaneous equations using the cuBlAS, cuSPARSE, and cuSOLVER libraries on the Compute Unified Device Architecture (CUDA) platform to overcome the second issue. The aim of this article is to explore the possibility to solve simultaneous equations with sparse matrix using CUDA libraries. Three algorithms of cuSOLVER libraries were evaluated through case studies. The three algorithms include the QR factorization, incomplete LU factorization, and the conjugate gradient methods. We have evaluated the efficiency, accuracy and capacity of the algorithms. Two type of case studies were investigated in the numerical experiments. First type of sparse matrix were downloaded from the Florida Sparse Matrix Collection and second type of matrix were assembled from truss structure using FEM. Numerical experiment results showed that, the accuracy of incomplete LU factorization is unacceptable, the capacity of the QR factorization is weak, since the GPU memory demand is higher. The efficiency, capacity and accuracy of the conjugate gradient methods all are satisfied. However, it may break down for some particular cases.

Keywords: FEM; cuSPARSE; cuBLAS; cuSOLVER; Compressed sparse row format matrix; QR Factorization; Incomplete LU Factorization; Conjugate gradient

Introduction

For many large-scale FEM calculations, solving sparse linear equations takes up most of the computing time and resources. Taking the static analysis of structures using the finite element method as an example, the solution of sparse linear equations usually accounts for more than 70% of the whole analysis time [1]. With the continuous expansion of engineering scale and the continuous improvement of engineering complexity, higher requirements are put forward for the scale and speed of solving sparse linear equations. Therefore, the use of high-performance computing is a critical way to solve sparse linear equations.

Nowadays, the direct method and the iterative method are the two most popular methods to solve sparse linear equations. The direct process is to transform the coefficient matrix of the equationsinto a triangular matrix and then use the back-substitution method or the pursuit method to get the solution of the equations. The iterative approach is a process of constructing an infinite sequence to approximate the exact solution from an approximate initial valueof the solution. The rest of the paper is organized as follows: Section 2 gives a brief description of the CUDA platform. We focus onthree essential libraries of CUDA Toolkit, cuBLAS, cuSOLVER, and cuSPARSE. Section 3 defines the CSR storage format, which was used in this paper. Section 4 demonstrates how cuBLAS, cuSOLVER and cuSPARSE were implemented in detail and error analysis. Section 5 shows two types of case studies to solve sparse simultaneousequations. In the first type of case study, the matrix was download from the Sparse Matrix Library [2]. The second type of case study is an example to analyze truss structure using FEM. Section 6 is the Global Journal of Engineering Sciences Volume 8-Issue 4Citation: Cankun Zheng, Wenhao Lin, Jiujiang Zhu. CUDA-based Accelerated FEM Calculation. Glob J Eng Sci. 8(4): 2021. GJES.MS.ID.000690. DOI: 10.33552/GJES.2021.08.000690.Page 2 of 8 conclusion.

CUDA Platform

The CUDA platform allows developers to use the C language as a development language and also supports other programming languages or interfaces. Under the CUDA architecture, a program is divided into two parts: the Host side, which is executed serially on the CPU, and the Device side (also known as the kernel), which is implemented in parallel on the graphics card. Usually, the execution model of a complete program is that the host side prepares the data and copies it to the graphics card’s memory. The GPU executes the kernel program of the device side for parallel calculation. After the calculation is completed, the result is sent back to the host side from the memory of the graphics card [3]. The cuBLAS library implements BLAS (Basic Linear Algebra Subprograms) on top of the NVIDIA ®CUDATM runtime [4]. The cuBLAS library is mainly used for matrix operation. It contains two sets of API. One is the commonly used cuBLAS API, which requires users to allocate GPU memory space and fill in data according to the specified format. The other one is cuBLASXT API, which an assign data at the CPU end, then call function. It will automatically manage memory and perform computation [5]. In practice, the first set of APIs is usually used.

The cuSPARSE library contains a set of basic linear algebra subroutines used for handling sparse matrices. The goal of the standard library is a matrix with a certain number of (structurally) zero elements that represent >95% of the total entries. It is implemented on top of the NVIDIA ®CUDATM runtime (part of CUDA Toolkit). And it is designed to be called from C and C++ [6].

cuSPARSE is acollection of column interfaces. And the main advantage is that itonly needs to call the corresponding API when the users write the host code. So, it can save a lot of development time, and this library also supports many data formats. cuSlOLVER is an advanced library. It is based on the cuBLAS and cuSPARES library and includes matrix factorization and solving equations. It contains three independent library files, including cuolverDN, cuSolverSP, and cuSolverRF. The cuSolverDN can performdense matrix factorization (LU, QR, SVD, LDLT), while the cuSolverSP is a new toolset to manipulate sparse matrices, such as to solvesimultaneous equations with sparse matrix. The cuSolverRF is asparse refactoring package [7].

Storage Format

In solving the simultaneous sparse linear equation, the coefficient matrix is sparse. Therefore, the selection of storage format is crucial to reduce the storage of unnecessary zeros, and it is also imperative to improve the performance of its operation. We mainly introduce the compressed sparse row format used in this paper. CSR format is a mainstream general sparse matrix representation format, which stores column indices and nonzero values in array columns and array values [8].

Implementation Method and Relative Error Analysis Based on cuBlAS, cuSPARSE, and cuSOLVER

The solution methods of large sparse linear equations usually include the direct method and the iterative method. QR Factorization QR factorization is based on matrix factorization. Based on elimination, the linear equations are transformed into several equivalent sub-problems, which are easy to be calculated and solved in turn. Theoretically, the exact solution of the system of equations can be obtained by the direct method in a fixed number of steps without considering the accumulative numerical error.

• Apply for variable memory space on the GPU and transfer Col, row, Val and b from the CPU to the GPU.

• Find the final X value by calling cusolverSpDcsrlsvqr.

• Return the X value from the GPU side to the CPU side.

Incomplete LU Factorization

Incomplete LU factorization is an iterative solution method, which is related to the LU factorization method. The LU factorization method is a direct solution method. When the matrix A is asparse matrix, the matrix L and U lose the sparse characteristics of the matrix A. The storage cost of the matrix that loses the sparse property will increase, so the incomplete LU factorization method Citation: Cankun Zheng, Wenhao Lin, Jiujiang Zhu. CUDA-based Accelerated FEM Calculation. Glob J Eng Sci. 8(4): 2021. GJES.MS.ID.000690. DOI: 10.33552/GJES.2021.08.000690.Global Journal of Engineering Sciences Volume 8-Issue 4 Page 3 of 8 is proposed in [9], so that the iterative matrix K in A = K – R satisfies:

K = L X U and the matrices L and U have the sparse property. The algorithm of incomplete LU factorization can be described as follows: Operations on sparse matrices stored in CSR format have been provided in the cuSPARSE library. Call the cusparseDcsrilu02 function to decompose the matrix into a unit lower triangular matrix Land an upper triangular matrix U. Call the cusparseDcsrsv2 _ solve function to find the value of the matrix Z, and then repeat the call to find the final value X. In this way, the algorithm of incomplete LU factorization is completed by calling the corresponding function through the cuSPARSE library. This article refers to this algorithm as cuSPARSE_LU. The specific algorithm flow is as follows:

• Apply for variable memory space on the GPU and transfer Col, row, Val and b from the CPU to the GPU.

• By calling cusparseCreateMatDescr, cusparseSetMatIndex Base, cusparseSetMatType, cusparSESetMatFillMode, CusparseSetMatDiagType creates descriptors of matrix A, matrix L, and matrix U, and opens the corresponding memory by calling csru02Info_t and csrsv2Info_t.

• Analyze the LU by calling cusparseDcsru02_analysis, cusparseDcsrsv2_analysis.

• The matrix K is decomposed into a lower triangular matrix Land an upper triangular matrix U by calling cusparseDcsrilu02 and cusparseXcsrilu02_zeroPivot.

• Solve the value of matrix Z by calling cusparseDcsrsv2_to solve.

• Find the final X value by calling cusparseDcsrsv2_to solve.

• Return the X value from the GPU side to the CPU side.

Conjugate gradient method

The conjugate gradient method is one of the most commonly used iterative methods for solving linear equations [10]. The conjugate gradient method uses successive approximation, generally using the iterative formula to get a series of approximate solutionsgradually approaching the real solution, and finally get the approximate solution satisfying certain error tolerance.

The conjugate gradient algorithm is described as follows:

1) Set Initial guessed value X0.

The solution of this conjugate gradient method calls the cuBLAS and cuSPARSE libraries. Although each step of the conjugate gradient method is serial, it mainly involves sparse matrix and vector multiplication, vector and vector dot multiplication, vector update, and scalar division in the whole algorithm process. These operations can all involve data-level parallelism, thus enabling GPU parallel computing. That is to say, GPU is responsible for the parallel calculation between matrix and vector, vector and vector before iteration and in each iteration, and CPU is responsible for the control of iteration loop and convergence condition, as well as the operation of scalar multiplication. Operations on sparse matrices store in CSR format have been provided in the cuSPARSE library. By calling the cusparseSpMV function in the cuSPARSE library, a sparse matrix stored in CSR format can be multiplied by a vector. The operation of two-vector dot multiplication can be realized by calling the cublasDdot function in the cuBLAS library. The vector update is implemented by calling the cublasDaxpy function in the cuBLAS library. In this way, the cuSPARSE library and the cuBLAS library can be used to call the corresponding functions for the conjugate gradient algorithm. This article refers to this algorithm as cuSPARSE_CG.

The specific algorithm flow path is as follows:

Global Journal of Engineering Sciences Volume 8-Issue 4Citation: Cankun Zheng, Wenhao Lin, Jiujiang Zhu. CUDA-based Accelerated FEM Calculation. Glob J Eng Sci. 8(4): 2021. GJES.MS.ID.000690. DOI: 10.33552/GJES.2021.08.000690.

Page 4 of 8

• Apply memory space for variables on the GPU, and trans-

Comparison of QR Factorization, Incomplete LU Factorization and Conjugate Gradient Method for Solving Truss Structure using FEM

The overall stiffness matrix of the truss elements in this test is simple, the length L, cross-sectional area A and elastic modulus E of the truss are all set as 1, and the relative residual error is set at 10-6. The simple overall stiffness matrix results in that the QR factorization and incomplete LU factorization method have break neck calculation speeds. For example, to solve 100000 orders matrix equations, the computational time of the conjugate gradient is about 4.07 times more than that of the incomplete LU factorization, while the conjugate gradient computational time is about 2.64 times more than that of the QR factorization. Table 4 compares the computation time of solving the global stiffness matrix of the truss elements using the QR factorization, incomplete LU factorization, and the conjugate gradient method. The relative error analysis and comparison are shown in Table 5. The accuracies of both QR factorization and incomplete LU factorization are satisfied. For large-scale simulation, the conjugate gradient method was not convergent, i.e. the iteration steps are less than the total freedom of the system. For this particular structure, the conjugate gradient method only convergent when iteration steps equal the total freedom of the system. If the iteration steps equal the total freedom of the system, iteration solution equals exact solution. Table 4: Comparison of calculation time of three algorithms for truss structure.

Reason analysis: Although the computational efficiency of the conjugate gradient method and QR factorization in the Florida sparse library is slower than the incomplete LU factorization, the relative error of QR factorization and the conjugate gradient is much smaller than the incomplete LU factorization. Due to in incomplete LU factorization, some elements of the matrix R in Eq. (5) were dopped off; this causes incomplete LU factorization to be just an approximate method. Our testing results demonstrate that the accuracy of incomplete LU factorization is unacceptable for accurate engineering simulation. In practice, usually, incomplete LU factorization is only suitable to be used as a preprocessing. Although

both efficiency and accuracy of QR factorization are satisfied, our numerical experiments show that the GPU memory consumption of QR factorization is much higher than the other two methods. It means QR factorization is not suitable for large scall simulation, at least for some cheap GPUs. Our testing shows that, in most cases, the efficiency, accuracy, and capacity of conjugate gradient all are good enough for FEM simulation; however conjugate gradient may break down for some particular cases. For example, in our truss structure FEM simulation, if the iteration step is less than the total freedoms of the system, the residual error keeps as constant without change. If and only if when the iteration step equals the freedoms of the system, the residual error becomes zero, i.e., the iteration solution reaches the exact solution. For large-scale simulation, it is not feasible to make iteration steps equal the freedoms of the system. It needs a very long iteration time.

Conclusion and Future Work

In this paper, we have compared the computational efficiency, accuracy, and capability of three algorithms: QR factorization, incomplete LU factorization, and conjugate gradient method using cuBlAS, cuSPARSE, and cuSOLVER libraries. We have reached the following conclusions:

 In terms of computational efficiency, the incomplete LU factorization is the fastest, and the conjugate gradient is the slowest. The running speed of the incomplete LU factorization is 11.6 times faster than that of the conjugate gradient. The running speed of the QR factorization is 3.3 times faster than that of the conjugate gradient.  In terms of computational accuracy, both the QR factorization and the conjugate gradient methods are good enough for accurate engineering simulation. The incomplete LU factorization method is unacceptable. In terms of computational capacity, both the incomplete LU factorization and the conjugate gradient methods are acceptable. The QR factorization method requires too much GPU memory, so it is not suitable for large-scale simulation. To compare all three algorithms, we will find the conjugate gradient method is the best candidate for large-scale FEM simulation. However, the conjugate gradient method may break down for some particular cases. We proposed to combine the incomplete LU factorization and the conjugate gradient method in our future work. To use incomplete LU factorization as a preprocessing to carry out the precondition conjugate gradient simulation.

Acknowledgment

This research is supported by the project of “High Education funding of Guangdong Province: 2018KZDXM072”; the College Students’ Innovation and Entrepreneurship Project: 5031700602O6; the Natural Science Foundation of Guangdong Province:

Conflict of Interest

No conflict of interest.

To read more about this article: https://irispublishers.com/gjes/pdf/GJES.MS.ID.000691.pdf

Indexing List of Iris Publishers: https://medium.com/@irispublishers/what-is-the-indexing-list-of-iris-publishers-4ace353e4eee

Iris publishers google scholar citations: https://scholar.google.co.in/scholar?hl=en&as_sdt=0%2C5&q=irispublishers&btnG=

Search This Blog

Iris Publishers Journal Of Engineering Sciences

Iris Publishers

Comments

Post a Comment

Popular posts from this blog

Gully Erosion and Landslides in Southeastern Nigeria: Causes, Consequences and Control Measures

Iris Publishers

Importance of technology acceptance assessment for successful implementation and development of new technologies