大規模なスパース性を有する線形システムの解法は、計算科学、工学のアプリケーション分野において重要な位置づけであり、これらは一般に直接法や反復法の手法が使用されている。直接法を利用した解法では、信頼性は高いがメモリ容量の問題により大規模な問題が解けないことも多く、従来から反復法を用いた解法が利用されている。反復法には様々な解法が存在するが、ここでは、正定値対称行列を扱える Conjugate Gradient(CG) 法と非対称線形システムを対象とする Bi-Conjugate Gradient Stabilized(BiCGStab)法を CUDA Library(cuSPARSE, cuBLAS) を利用して OpenACC プログラミングモデルで作成したプログラムを例示する。なお、使用言語は Fortran とする。
ここで例示するプログラムのアルゴリズムは以下の参考文献を参照し Fortran で組んだものである。
プログラムは、基本的に反復法のアルゴリズムに対して CUDA cuSPARSE/cuBLAS ライブラリを使用してコーディングする形態となる。OpenACC ディレクティブは、デバイス側のデータとなり得る変数・配列に対してディレクティブにより指示を行うだけの用途で使用する。すなわち、OpenACC により配列、変数のデバイスデータのマネージメントだけを行うこと以外は、一般的な Fortran プログラムとなる。CUDAプログラム言語は一切使用しない。
使用する OpenACC のディレクティブは、デバイスデータが使用されるスコープ範囲を指定する data 構文を使用することと、CUDA ライブラリルーチンをコールする際に渡す実引数が「デバイスポインタ」であることを、コンパイルに伝えるために使用する、host_data use_device ディレクティブの利用の二つとなる。
以下の例は、PGI 17.9 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.9-0 64-bit target on x86-64 Linux -tp haswell PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
ここで例示する Fortran プログラムには、(1) 前処理なしの CG アルゴリズムを適用したルーチン(subroutine CG)、(2) 不完全 LU 分解の前処理を施した CG ルーチン(subroutine ICLU_CG)、(3) 不完全コレスキー分解の前処理を施した CG ルーチン(subroutine ICCL_CG) を組み込んだ。いずれも、cuBLAS と cuSPARSE ルーチンを使用して CG 実行を実現している。以下のアルゴリズムは、不完全コレスキー分解の前処理を適用した場合の処理となる。不完全 LU 分解を適用する場合も、分解方法が異なるが基本的に同じ流れとなる。
参照アルゴリズム NVIDIA CUDA Toolkit Documentation
入力する係数行列のストレージ・フォーマットは、CSR 形式としている。この詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。CSR 形式のデータを用意して、このプログラムに入力するインタフェースを作成すれば、各種 CG 実行が可能である。
係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。
nz | 整数 | 行列内の非ゼロ要素の数 |
csrValA | 配列 | 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format. |
csrRowPtrA | 配列 | 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる |
csrColIndA | 配列 | 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列 |
Incomplete-LU preconditioned CG iterative method
Incomplete-Cholesky preconditioned CG iterative method
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved. ! ! Permission is hereby granted, free of charge, to any person obtaining a ! copy of this software and associated documentation files (the "Software"), ! to deal in the Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, sublicense, ! and/or sell copies of the Software, and to permit persons to whom the ! Software is furnished to do so, subject to the following conditions: ! ! The above copyright notice and this permission notice shall be included in ! all copies or substantial portions of the Software. ! ! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR ! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL ! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER ! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING ! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER ! DEALINGS IN THE SOFTWARE. ! module precision_m ! integer, parameter, public :: fp_kind = kind(0.0) ! Single precision integer, parameter, public :: fp_kind = selected_real_kind(15,305) ! Double precision end module precision_m module measure_time integer :: clock_start,clock_end,clock_rate character*40 time_tag contains integer function init_clock call system_clock(count_rate=clock_rate) ! Find the rate clock_start = 0; clock_end = 0 end function integer function start_clock call system_clock(count=clock_start) ! Start timing end function integer function end_clock call system_clock(count=clock_end) ! End timing end function integer function print_time(time_tag) character*40 time_tag print '(/1x,a,2x,f10.3,a/)', time_tag, & REAL(clock_end-clock_start,kind(0d0))/REAL(clock_rate,kind(0d0)), " (sec)" end function end module measure_time module CG_routines use precision_m implicit none integer :: nprint integer, parameter :: idbg = 2 integer, parameter :: max_iter = 10000 real(fp_kind), parameter :: tol = 1.0d-12 integer, parameter :: index_base = 1 ! CSR format integer :: nz ! nonzero element of A (m x n ; n = m) integer, allocatable, dimension(:) :: csrRowPtrA ! m + 1 integer, allocatable, dimension(:) :: csrColIndA ! real(fp_kind), allocatable, dimension(:) :: csrValA ! real(fp_kind), allocatable, dimension(:) :: X real(fp_kind), allocatable, dimension(:) :: rhs contains subroutine tridaiag_matgen(n) !------------------------------------------------------------- ! genTridiag: generate a random tridiagonal symmetric matrix ! A matrix : sparse CSR storage format ! N x N symmetric positive definite matrix !------------------------------------------------------------- integer :: n ! size of matrix A integer :: i, js real(fp_kind) :: rnd, rand external rand nz = (n-2)*3 + 4 ! non-zero elements allocate( csrRowPtrA(n+1) ) allocate( csrColIndA(nz) ) allocate( csrValA(nz) ) ! matrix A call random_number(rnd) csrRowPtrA(1) = 1 ! i = 1 csrColIndA(1) = 1 ; csrValA(1) = rnd + 10.0 csrColIndA(2) = 2 ; csrValA(2) = rnd do i = 2, n if ( i > 2 ) then csrRowPtrA(i) = csrRowPtrA(i-1) + 3 else csrRowPtrA(2) = 3 end if js = (i-2)*3 + 2 + index_base csrColIndA(js) = i - 1 csrColIndA(js+1) = i csrColIndA(js+2) = i + 1 call random_number(rnd) csrValA(js) = csrValA(js-1) csrValA(js+1) = rnd + 10.0 if ( i < n ) then call random_number(rnd) csrValA(js+2) = rnd end if end do csrRowPtrA(n+1) = nz + csrRowPtrA(1) ! allocate X, rhs allocate( X(n) ) allocate( rhs(n) ) ! Initilize rhs rhs = 1.0 ! check the number of non-zero element print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n print *, " Check a number of non-zero element" print *, " nz =", nz end subroutine tridaiag_matgen subroutine laplace_2nd_matgen(n) !------------------------------------------------------------- ! genTridiag: generate a tridiagonal symmetric matrix ! A matrix : sparse CSR storage format ! a second order, regular, Laplacian operator !------------------------------------------------------------- integer :: n ! size of matrix A integer :: n_route integer :: i, idx integer :: ix, iy if ( n /= sqrt(real(n))*sqrt(real(n)) ) stop "N number" nz = 5*N-4*int(sqrt(real(n))) ! non-zero elements allocate( csrRowPtrA(1:n+1) ) allocate( csrColIndA(1:nz) ) allocate( csrValA(1:nz) ) ! allocate X, rhs allocate( rhs(1:n) ) ; rhs = 0.0 ! Initialize RHS n_route = sqrt(real(n)) ! laplace dimension idx = 1 ! matrix A do i = 1, n ix = mod(i-1,n_route) iy = (i-1)/n_route !print *,"ix,iy=",ix,iy csrRowPtrA(i) = idx ! up if ( iy > 0 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i - n_route idx = idx + 1 else rhs(i) = rhs(i) - 1.0 end if ! left if ( ix > 0 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i - 1 idx = idx + 1 else rhs(i) = rhs(i) - 0.0 end if ! center csrValA(idx) = -4.0 csrColIndA(idx) = i idx = idx + 1 ! right if ( ix < n_route-1 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i + 1 idx = idx + 1 else rhs(i) = rhs(i) - 0.0 end if ! down if ( iy < n_route-1 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i + n_route idx = idx + 1 else rhs(i) = rhs(i) - 0.0 end if end do csrRowPtrA(n+1) = idx ! check the number of non-zero element print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n print *, " Check a number of non-zero element" print *, " idx =", idx, "nz =", nz ! allocate X, rhs allocate( X(n) ) end subroutine laplace_2nd_matgen !============================================== subroutine CG(n) !---------------------------------------------- ! Conjugate gradient without preconditioning !---------------------------------------------- use precision_m use measure_time use cublas_v2 use cusparse implicit none integer :: n ! size of matrix type(cusparseHandle) :: cusparse_H type(cublasHandle) :: cublas_H type(cusparseMatDescr) :: descrA integer :: cusparse_status integer :: cublas_status integer :: status, ierr_code real(fp_kind) :: r0, r1, alpha, alpham1, beta real(fp_kind) :: dot, nalpha real(fp_kind) :: error, rsum, diff real(fp_kind), allocatable, dimension(:) :: r real(fp_kind), allocatable, dimension(:) :: p real(fp_kind), allocatable, dimension(:) :: q real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0 integer :: i, j, k allocate( r(n), p(n) ) allocate( q(n) ) ! Conjugate gradient without preconditioning print *,"-------------------------------------------------------------" print *,"Conjugate gradient without preconditioning" print *,"-------------------------------------------------------------" ! Initilize X, rhs X = 0.0 ! or Initial guess X is set ! Copy rhs to r (Initial value r for rhs which is "b" vector) do i = 1, n r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b) end do ! creater handle for cuBLAS status = cublasCreate(cublas_H) ! creater handle for CUSPARSE and matrix A descriptor status = cusparseCreate(cusparse_H) if (status /= CUSPARSE_STATUS_SUCCESS) write(*,*) 'cusparseCreate error: ', status status = cusparseCreateMatDescr(descrA) ! matrix A properties status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base = 1 alpha = 1.0 beta = 0.0 k = 0 r0 = 0.0 status = init_clock status = start_clock !$acc data copy(r, X) copyin(csrValA, csrRowPtrA, csrColIndA) & !$acc create(p, q) ! === Initial guess x0 and residual r1 = r - Ax0 !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, q, X, r) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0 cublas_status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) ! r = r - q cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1= (r, r) !$acc end host_data do while ( r1 > tol*tol .and. k < max_iter ) k = k + 1 if ( k == 1 ) then !$acc host_data use_device(r,p) cublas_status = cublasDcopy(cublas_H, n, r, 1, p, 1) ! p = r !$acc end host_data else beta = r1/r0; ! beta = r(i) / r(i-1) ! p = r + beta * p !$acc host_data use_device(r,p) cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta*p cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1) ! p = r + p !$acc end host_data end if ! Compute q = A p !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, q ,p) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, p, zero, q) cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, dot) ! dot = p{T} * q !$acc end host_data alpha = r1/dot ! x = x + alpha*p !$acc host_data use_device(p, X) cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1) !$acc end host_data nalpha = -alpha ! r = r - alpha*q !$acc host_data use_device(q, r) cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1) !$acc end host_data r0 = r1 ! r0 = r1 !$acc host_data use_device(r) cublas_status = cublasDdot(cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r) !$acc end host_data if (idbg == 1 ) print *, "iteration=", k, " residual =", sqrt(r1) end do !$acc end data status = end_clock print *, "iteration=", k, " residual =", sqrt(r1) if ( sqrt(r1) < tol ) then print *, "CG without preconditioning test is passed" else print *, "CG without preconditioning test is failed" end if time_tag = "Elapased time CG without preconditioning : " status = print_time(time_tag) ! print *, "X=",X ! check result error = 0.0 do i = 1, n rsum = 0.0 do j = csrRowPtrA(i), csrRowPtrA(i+1)-1 rsum = rsum + csrValA(j) * X(csrColIndA(j)) end do diff = abs(rsum - rhs(i)) if (diff > error) error = diff end do print *,"error =",error ! output result if (idbg == 2) then nprint = n if ( n > 10 ) nprint = 10 do i = 1, nprint print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do print *,"" do i = n-nprint, n print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do end if status = cublasDestroy(cublas_H) status = cusparseDestroy(cusparse_H) deallocate( q, r, p ) end subroutine CG !====================================================== subroutine ICLU_CG(n) !------------------------------------------------------ ! Incomplete-LU preconditioned CG -- ILU(0) ! 0 fill-in and no pivoting ! A is an m × m sparse matrix defined in CSR storage format !------------------------------------------------------ use precision_m use measure_time use cublas_v2 use cusparse implicit none integer :: n ! size of matrix type(cusparseHandle) :: cusparse_H type(cublasHandle) :: cublas_H type(cusparseMatDescr) :: descr2 type(cusparseMatDescr) :: descrL type(cusparseMatDescr) :: descrU type(cusparseSolveAnalysisInfo) :: infoSolA, infoILU integer :: cusparse_status integer :: cublas_status integer :: status, ierr_code real(fp_kind) :: r1, alpha, alpham1, beta real(fp_kind) :: abc, efg real(fp_kind) :: nalpha real(fp_kind) :: error, rsum, diff real(fp_kind), allocatable, dimension(:) :: valsILU0 real(fp_kind), allocatable, dimension(:) :: Lov, z, r_p, z_p real(fp_kind), allocatable, dimension(:) :: r real(fp_kind), allocatable, dimension(:) :: p real(fp_kind), allocatable, dimension(:) :: q real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0 integer :: i, j, k allocate( r(n), p(n) ) allocate( q(n) ) allocate( valsILU0(nz) ) allocate( Lov(n), z(n), r_p(n) ,z_p(n) ) ! Conjugate gradient using incomplete-LU print '(/1x,a)',"-------------------------------------------------------------" print '(1x,a)', "Preconditioned Conjugate Gradient using incomplete-LU(0)" print *, "-------------------------------------------------------------" ! Initilize X, rhs X = 0.0 ! or Initial guess X is set ! Copy rhs to r (Initial value r for rhs which is "b" vector) do i = 1, n r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b) end do alpha = 1.0 beta = 0.0 k = 0 ! Clock status = init_clock status = start_clock ! creater handle for cuBLAS status = cublasCreate(cublas_H) ! creater handle for CUSPARSE status = cusparseCreate(cusparse_H) ! matrix A descriptor(descr2) and properties status = cusparseCreateMatDescr(descr2) status = cusparseSetMatType(descr2, CUSPARSE_MATRIX_TYPE_GENERAL) ! must be GENERAL for using csrilu0 status = cusparseSetMatIndexBase(descr2, CUSPARSE_INDEX_BASE_ONE) ! base = 1 ! initiialize the analysis info for A matrix (InfoSolA = sparsity pattern of A) status = cusparseCreateSolveAnalysisInfo(infoSolA) ! initiialize the ILU0 preconditioner Info (infoILU) status = cusparseCreateSolveAnalysisInfo(infoILU) ! Setup for Lower Matrix status = cusparseCreateMatDescr (descrL) status = cusparseSetMatType (descrL, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE) status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER) status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT) ! Setup for Upper Matrix status = cusparseCreateMatDescr (descrU) status = cusparseSetMatType (descrU, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE) status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER) status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT) ! copy original vals to new ILU0-vals valsILU0 = csrValA !$acc data copy(r, X) copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA) & !$acc create(p, q, Lov, z, r_p, z_p) ! === Initial guess x0 and residual r1 = r - Ax0 !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, q, r) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descr2, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0 status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) ! r = r - q status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r) !$acc end host_data !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0) ! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoSolA/descr2 cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, nz, descr2, csrValA, csrRowPtrA, csrColIndA, infoSolA) ! the analysis phase of the solution of a sparse linear system for infoILU0/descrU cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoILU) ! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix ! output : valsILU0 cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descr2, valsILU0, & csrRowPtrA, csrColIndA, infoSolA); !$acc end host_data status = end_clock time_tag = "The analysis phase of ILU(0)" status = print_time(time_tag) status = start_clock do while ( r1 > tol*tol .and. k <= max_iter ) ! Solve using valsILU0 vector from Incomplete-LU factorization !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, z) ! Forward Solve, L part for infoSolA(the sparsity pattern of A = descL) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, & valsILU0, csrRowPtrA, csrColIndA, infoSolA, r, Lov) ! Back Substitution (U part for infoILU = descU) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, & valsILU0, csrRowPtrA, csrColIndA, infoILU, Lov, z) ! solve z !$acc end host_data k = k + 1 if ( k == 1 ) then !$acc host_data use_device(z, p) cublas_status = cublasDcopy(cublas_H, n, z, 1, p, 1) ! p = z !$acc end host_data else !$acc host_data use_device(r, z, r_p, z_p) cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z) r(i) cublas_status = cublasDdot(cublas_H, n, r_p, 1, z_p, 1, efg) ! efg = (r_p, z_p) r(i-1) !$acc end host_data beta = abc / efg ! beta = r(i) / r(i-1) ! p = z + beta * p !$acc host_data use_device(z,p) cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta*p cublas_status = cublasDaxpy(cublas_H, n, one, z, 1, p, 1) ! p = z + p !$acc end host_data end if ! Compute q = A p !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r, z, p, q) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descrU, csrValA, csrRowPtrA, csrColIndA, p, zero, q); cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z) cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, efg) ! efg = (p, q) !$acc end host_data alpha = abc / efg nalpha = -alpha !$acc host_data use_device(p, X, r, r_p, z, z_p, q) cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1) ! X = X + alpha * p cublas_status = cublasDcopy(cublas_H, n, r, 1, r_p, 1) ! r_p = r cublas_status = cublasDcopy(cublas_H, n, z, 1, z_p, 1) ! z_p = z cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1) ! r = r - alpha * q cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r) !$acc end host_data if (idbg == 1 ) print *, "iteration=", k, " residual =", sqrt(r1) end do !$acc end data print *, "iteration=", k, " residual =", sqrt(r1) if ( sqrt(r1) < tol ) then print *, "Preconditioned Conjugate Gradient using ILU is passed" else print *, "Preconditioned Conjugate Gradient using ILU is failed" end if status = end_clock time_tag = "Elapased time for GG with ILU(0) :" status = print_time(time_tag) ! check result error = 0.0 do i = 1, n rsum = 0.0 do j = csrRowPtrA(i), csrRowPtrA(i+1)-1 rsum = rsum + csrValA(j) * X(csrColIndA(j)) end do diff = abs(rsum - rhs(i)) if (diff > error) error = diff end do print *,"error =",error ! output result if (idbg == 2) then nprint = n if ( n > 10 ) nprint = 10 do i = 1, nprint print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do print *,"" do i = n-nprint, n print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do end if status = cublasDestroy(cublas_H) status = cusparseDestroy(cusparse_H) status = cusparseDestroyMatDescr(descr2) status = cusparseDestroyMatDescr(descrL) status = cusparseDestroyMatDescr(descrU) status = cusparseDestroySolveAnalysisInfo(infoSolA) status = cusparseDestroySolveAnalysisInfo(infoILU) deallocate( q, r, p ) deallocate( Lov, z, valsILU0, r_p, z_p ) end subroutine ICLU_CG !====================================================== subroutine ICCL_CG(n) !------------------------------------------------------ ! Incomplete-Cholesky preconditioned CG -- IC(0) ! 0 fill-in and no pivoting ! A is an m × m Hermitian/symmetric positive definite sparse matrix ! defined in CSR storage format !------------------------------------------------------ use precision_m use measure_time use cublas_v2 use cusparse implicit none integer :: n ! size of matrix type(cusparseHandle) :: cusparse_H type(cublasHandle) :: cublas_H type(cusparseMatDescr) :: descr3 type(cusparseMatDescr) :: descrL type(cusparseSolveAnalysisInfo) :: infoSolA, infoCL, infoCLt integer :: cusparse_status integer :: cublas_status integer :: status, ierr_code real(fp_kind) :: r1, alpha, alpham1, beta real(fp_kind) :: abc, efg real(fp_kind) :: nalpha real(fp_kind) :: error, rsum, diff real(fp_kind), allocatable, dimension(:) :: valsCL real(fp_kind), allocatable, dimension(:) :: Lov, z, r_p, z_p real(fp_kind), allocatable, dimension(:) :: r real(fp_kind), allocatable, dimension(:) :: p real(fp_kind), allocatable, dimension(:) :: q real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0 integer :: i, j, k allocate( r(n), p(n) ) allocate( q(n) ) allocate( valsCL(nz) ) allocate( Lov(n), z(n), r_p(n) ,z_p(n) ) ! Conjugate gradient using incomplete-Cholesky for Symmetric matrix print '(/1x,a)',"-------------------------------------------------------------" print '(1x,a)', "Preconditioned Conjugate Gradient using incomplete-Cholesky(0)" print *, "-------------------------------------------------------------" ! Initilize X, rhs X = 0.0 ! or Initial guess X is set ! Copy rhs to r (Initial value r for rhs which is "b" vector) do i = 1, n r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b) end do alpha = 1.0 beta = 0.0 k = 0 ! Clock status = init_clock status = start_clock ! creater handle for cuBLAS status = cublasCreate(cublas_H) ! creater handle for CUSPARSE status = cusparseCreate(cusparse_H) ! matrix A descriptor(descr3) and properties status = cusparseCreateMatDescr(descr3) status = cusparseSetMatType(descr3, CUSPARSE_MATRIX_TYPE_GENERAL) ! descr3 as a "GENERAL" or "SYMMETRIC" status = cusparseSetMatIndexBase(descr3, CUSPARSE_INDEX_BASE_ONE) ! base = 1 ! CAUTION - this program don't generate a lower triangular data for the matrix A ! So, this program don't use a lower triangular data for the incomplete cholesky preconditioner ! initiialize the analysis info for A matrix (InfoSolA = sparsity pattern of A) status = cusparseCreateSolveAnalysisInfo(infoSolA) ! initiialize the CL(0) preconditioner Info status = cusparseCreateSolveAnalysisInfo(infoCL) status = cusparseCreateSolveAnalysisInfo(infoCLt) ! Setup for Lower Matrix status = cusparseCreateMatDescr (descrL) status = cusparseSetMatType (descrL, CUSPARSE_MATRIX_TYPE_GENERAL) ! not TRIANGULAR status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE) status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER) status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_NON_UNIT) ! copy original vals to new vals-CL valsCL = csrValA !$acc data copy(r, X) copyin(valsCL, csrValA, csrRowPtrA, csrColIndA) & !$acc create(p, q, Lov, z, r_p, z_p) ! === Initial guess x0 and r1 = r - Ax0 !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, q, r) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descr3, csrValA, csrRowPtrA, csrColIndA, X, zero, q); status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! Initial r1 = (r, r) !$acc end host_data !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA) ! the analysis phase of the solution of a sparse linear system for infoCL/descrL cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoCL) cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_TRANSPOSE, & n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoCLt) !$acc end host_data status = end_clock time_tag = "The analysis phase of IC(0)" status = print_time(time_tag) status = start_clock do while ( r1 > tol*tol .and. k <= max_iter ) ! Solve using valsCL vector from Incomplete-Cholesky factorization M z = r !$acc host_data use_device(valsCL, csrRowPtrA, csrColIndA, r, Lov, z) ! infoCLt triangular solve cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_TRANSPOSE, n, one, descrL, & valsCL, csrRowPtrA, csrColIndA, infoCLt, r, Lov) ! InfoCL triangular solve cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, & valsCL, csrRowPtrA, csrColIndA, infoCL, Lov, z) ! solve z !$acc end host_data k = k + 1 if ( k == 1 ) then !$acc host_data use_device(z, p) cublas_status = cublasDcopy(cublas_H, n, z, 1, p, 1) ! p = z !$acc end host_data else !$acc host_data use_device(r, z, r_p, z_p) cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z) r(i) cublas_status = cublasDdot(cublas_H, n, r_p, 1, z_p, 1, efg) ! efg = (r_p, z_p) r(i-1) !$acc end host_data beta = abc / efg ! beta = r(i) / r(i-1) ! p = z + beta * p !$acc host_data use_device(z,p) cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta*p cublas_status = cublasDaxpy(cublas_H, n, one, z, 1, p, 1) ! p = z + p !$acc end host_data end if ! Compute q = A p !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r, z, p, q) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descr3, csrValA, csrRowPtrA, csrColIndA, p, zero, q); cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z) cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, efg) ! efg = (p, q) !$acc end host_data alpha = abc / efg nalpha = -alpha !$acc host_data use_device(p, X, r, r_p, z, z_p, q) cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1) ! X = X + alpha * p cublas_status = cublasDcopy(cublas_H, n, r, 1, r_p, 1) ! r_p = r cublas_status = cublasDcopy(cublas_H, n, z, 1, z_p, 1) ! z_p = z cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1) ! r = r - alpha * q cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r) !$acc end host_data if (idbg == 1 ) print *, "iteration=", k, " residual =", sqrt(r1) end do !$acc end data print *, "iteration=", k, " residual =", sqrt(r1) if ( sqrt(r1) < tol ) then print *, "Preconditioned Conjugate Gradient using cholesky is passed" else print *, "Preconditioned Conjugate Gradient using cholesky is failed" end if status = end_clock time_tag = "Elapased time for GG with IC(0) :" status = print_time(time_tag) ! check result error = 0.0 do i = 1, n rsum = 0.0 do j = csrRowPtrA(i), csrRowPtrA(i+1)-1 rsum = rsum + csrValA(j) * X(csrColIndA(j)) end do diff = abs(rsum - rhs(i)) if (diff > error) error = diff end do print *,"error =",error ! output result if (idbg == 2) then nprint = n if ( n > 10 ) nprint = 10 do i = 1, nprint print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do print *,"" do i = n-nprint, n print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do end if status = cublasDestroy(cublas_H) status = cusparseDestroy(cusparse_H) status = cusparseDestroyMatDescr(descr3) status = cusparseDestroyMatDescr(descrL) status = cusparseDestroySolveAnalysisInfo(infoSolA) status = cusparseDestroySolveAnalysisInfo(infoCL) status = cusparseDestroySolveAnalysisInfo(infoCLt) deallocate( q, r, p ) deallocate( Lov, z, valsCL, r_p, z_p ) end subroutine ICCL_CG End module CG_routines program main use precision_m use CG_routines implicit none integer :: num_mat integer :: i, j !------------------------------------------------------------- ! A matrix : sparse CSR type matrix ! N x N symmetric positive definite matrix (num_mat) !------------------------------------------------------------- num_mat = 16 num_mat = 16384 ! num_mat = 1048576 ! num_mat = 4000000 print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter ! Select matrix data ! call tridaiag_matgen (num_mat) call laplace_2nd_matgen (num_mat) ! Print CSR Matrix data ! print *, "csrValA", csrValA ! print *, "csrRowPtrA",csrRowPtrA ! print *, "csrColIndA", csrColIndA ! Conjugate gradient without preconditioning call CG (num_mat) ! Preconditioned Conjugate Gradient using Incomplete-LU ! factorization with 0 fill-in and no pivoting call ICLU_CG (num_mat) ! Preconditioned Conjugate Gradient using Incomplete-Cholesky ! factorization call ICCL_CG (num_mat) end program main
コンパイルオプションとして、-acc -Mcudalib=cusparse,cublas -Mcuda オプションを付けてコンパイルリンクする必要がある。例示しているプログラムで使用しているテスト用の係数行列は、極めて収束性の良いデータを使用しているため、前処理なしの CG 計算が速い結果となっているが、実用上での係数行列の特性は固有値分布が様々なため、 CG 計算では収束出来ない場合も生じる。また、「前処理」を実施したとしても係数行列の素性によっては収束できない場合も多い。上記のプログラムは、OpenACC + CUDA library を使用して CG 法の実装モデルを示したにすぎない。実際の利用としては、対称行列に対象を限った不完全コレスキー分解の前処理付きルーチンよりも、不完全 LU 分解の前処理を施した CG ルーチン(subroutine ICLU_CG)を推奨する。あるいは、この後説明する BiCGStab 法を実装した CG ルーチン を利用する方が良いだろう。なお、PGI 17.9から CUDA Toolkit 9.0 ルーチンがコンパイラシステムに統合されたため、-ta=cuda9.0 を指定した。この指定は、cuda7.5 or cuda8.0 でも構わない。
$ pgf90 main.f90 -Mcudalib=cublas,cusparse -acc -ta=tesla,cc60,cuda9.0,cc35 -Minfo=accel -Mcuda -O2
cg:
255, Generating create(q(:))
Generating copy(r(:),x(:))
Generating copyin(csrrowptra(:),csrcolinda(:),csrvala(:))
Generating create(p(:))
iclu_cg:
457, Generating create(q(:))
Generating copyin(valsilu0(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
iccl_cg:
691, Generating create(q(:))
Generating copyin(valscl(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
[kato@photon32 CG]$ a.out
tol = .1000000000D-11 max_iteration = 10000
Matrix size = 1048576 x 1048576
Check a number of non-zero element
idx = 5238785 nz = 5238784
-------------------------------------------------------------
Conjugate gradient without preconditioning
-------------------------------------------------------------
iteration= 3531 residual = 9.9759469342430417E-013
CG without preconditioning test is passed
Elapased time CG without preconditioning 2.616 (sec)
error = 3.7858605139717838E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005659E-04
x(1048567)=0.104132673657E-04
x(1048568)=0.937223195617E-05
x(1048569)=0.833110455416E-05
x(1048570)=0.728989537764E-05
x(1048571)=0.624861464650E-05
x(1048572)=0.520727258128E-05
x(1048573)=0.416587940269E-05
x(1048574)=0.312444533254E-05
x(1048575)=0.208298059361E-05
x(1048576)=0.104149540857E-05
-------------------------------------------------------------
Preconditioned Conjugate Gradient using incomplete-LU(0)
-------------------------------------------------------------
The analysis phase of ILU(0) 0.151 (sec)
iteration= 1237 residual = 9.8237116899063428E-013
Preconditioned Conjugate Gradient using ILU is passed
Elapased time for GG with ILU(0) : 24.738 (sec)
error = 2.3092638912203256E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005653E-04
x(1048567)=0.104132673654E-04
x(1048568)=0.937223195581E-05
x(1048569)=0.833110455365E-05
x(1048570)=0.728989537732E-05
x(1048571)=0.624861464626E-05
x(1048572)=0.520727258084E-05
x(1048573)=0.416587940225E-05
x(1048574)=0.312444533231E-05
x(1048575)=0.208298059340E-05
x(1048576)=0.104149540829E-05
-------------------------------------------------------------
Preconditioned Conjugate Gradient using incomplete-Cholesky(0)
-------------------------------------------------------------
The analysis phase of IC(0) 0.138 (sec)
iteration= 1468 residual = 9.9588119511240996E-013
Preconditioned Conjugate Gradient using cholesky is passed
Elapased time for GG with IC(0) : 29.747 (sec)
error = 2.4313884239290928E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005654E-04
x(1048567)=0.104132673654E-04
x(1048568)=0.937223195580E-05
x(1048569)=0.833110455365E-05
x(1048570)=0.728989537735E-05
x(1048571)=0.624861464631E-05
x(1048572)=0.520727258092E-05
x(1048573)=0.416587940233E-05
x(1048574)=0.312444533239E-05
x(1048575)=0.208298059346E-05
x(1048576)=0.104149540832E-05
次に、非対称および対称正定値(spd)線形システムに適用できる、非定常反復法の一つである Bi-Conjugate Gradient Stabilized (BiCGStab) を cuBLAS ならびに cuSPARSE ルーチンを利用して作成した Fortran プログラムを紹介する。OpenACC ディレクティブを使用して、デバイス側のデータとなり得る変数・配列引数を指定するだけで、GPU上で動作するコードが作成できる。CUDA C/Fortran 言語で記述する際のようなデバイス用の変数、配列の割付等は一切不要である。実装アルゴリズムは、以下の NVIDIA Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS で説明されている内容を適用した。
反復法の収束性を早めるために、不完全 LU 分解による前処理を施している。係数行列の不完全 LU 分解は、0 level fill-in and no pivoting の cusparseDcsrilu0 ルーチンを使用する。これによって、反復計算前に、線形システムの係数行列のスペクトルを変更する。
入力する係数行列のストレージ・フォーマットは、CSR 形式としている。この詳細は、上述した。CSR 形式のデータを用意して、このプログラムに入力するインタフェースを作成すれば、BiCGStab 実行が可能である。
参照アルゴリズム NVIDIA CUDA Toolkit Documentation
! Copyright (c) 2017, Tsutomu Kato, SofTek Systems, Inc. All rights reserved. ! ! ! Permission is hereby granted, free of charge, to any person obtaining a ! copy of this software and associated documentation files (the "Software"), ! to deal in the Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, sublicense, ! and/or sell copies of the Software, and to permit persons to whom the ! Software is furnished to do so, subject to the following conditions: ! ! The above copyright notice and this permission notice shall be included in ! all copies or substantial portions of the Software. ! ! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR ! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL ! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER ! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING ! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER ! DEALINGS IN THE SOFTWARE. ! module precision_m ! integer, parameter, public :: fp_kind = kind(0.0) ! Single precision integer, parameter, public :: fp_kind = selected_real_kind(15,305) ! Double precision end module precision_m module measure_time integer :: clock_start,clock_end,clock_rate character*40 time_tag contains integer function init_clock call system_clock(count_rate=clock_rate) ! Find the rate clock_start = 0; clock_end = 0 end function integer function start_clock call system_clock(count=clock_start) ! Start timing end function integer function end_clock call system_clock(count=clock_end) ! End timing end function integer function print_time(time_tag) character*40 time_tag print '(/1x,a,2x,f10.3,a/)', time_tag, & REAL(clock_end-clock_start,kind(0d0))/REAL(clock_rate,kind(0d0)), " (sec)" end function end module measure_time module CG_routines use precision_m implicit none integer :: nprint integer, parameter :: idbg = 2 integer, parameter :: max_iter = 10000 real(fp_kind), parameter :: tol = 1.0d-13 integer, parameter :: index_base = 1 integer :: nz ! nonzero element of A (m x n ; n = m) integer, allocatable, dimension(:) :: csrRowPtrA ! m + 1 integer, allocatable, dimension(:) :: csrColIndA ! real(fp_kind), allocatable, dimension(:) :: csrValA ! real(fp_kind), allocatable, dimension(:) :: X real(fp_kind), allocatable, dimension(:) :: rhs contains subroutine tridaiag_matgen(n) !------------------------------------------------------------- ! genTridiag: generate a random tridiagonal symmetric matrix ! A matrix : sparse CSR storage format ! N x N symmetric positive definite matrix !------------------------------------------------------------- integer :: n ! size of matrix A integer :: i, js real(fp_kind) :: rnd, rand external rand nz = (n-2)*3 + 4 ! non-zero elements allocate( csrRowPtrA(n+1) ) allocate( csrColIndA(nz) ) allocate( csrValA(nz) ) ! matrix A call random_number(rnd) csrRowPtrA(1) = 1 ! i = 1 csrColIndA(1) = 1 ; csrValA(1) = rnd + 10.0 csrColIndA(2) = 2 ; csrValA(2) = rnd do i = 2, n if ( i > 2 ) then csrRowPtrA(i) = csrRowPtrA(i-1) + 3 else csrRowPtrA(2) = 3 end if js = (i-2)*3 + 2 + index_base csrColIndA(js) = i - 1 csrColIndA(js+1) = i csrColIndA(js+2) = i + 1 call random_number(rnd) csrValA(js) = csrValA(js-1) csrValA(js+1) = rnd + 10.0 if ( i < n ) then call random_number(rnd) csrValA(js+2) = rnd end if end do csrRowPtrA(n+1) = nz + csrRowPtrA(1) ! allocate X, rhs allocate( X(n) ) allocate( rhs(n) ) ! Initilize rhs rhs = 1.0 ! check the number of non-zero element print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n print *, " Check a number of non-zero element" print *, " nz =", nz end subroutine tridaiag_matgen subroutine laplace_2nd_matgen(n) !------------------------------------------------------------- ! genTridiag: generate a tridiagonal symmetric matrix ! A matrix : sparse CSR storage format ! a second order, regular, Laplacian operator !------------------------------------------------------------- integer :: n ! size of matrix A integer :: n_route integer :: i, idx integer :: ix, iy if ( n /= sqrt(real(n))*sqrt(real(n)) ) stop "N number" nz = 5*N-4*int(sqrt(real(n))) ! non-zero elements allocate( csrRowPtrA(1:n+1) ) allocate( csrColIndA(1:nz) ) allocate( csrValA(1:nz) ) ! allocate rhs allocate( rhs(1:n) ) ; rhs =0.0 ! Initialize RHS n_route = sqrt(real(n)) ! laplace dimension idx = 1 ! matrix A do i = 1, n ix = mod(i-1,n_route) iy = (i-1)/n_route !print *,"ix,iy=",ix,iy csrRowPtrA(i) = idx ! up if ( iy > 0 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i - n_route idx = idx + 1 else rhs(i) = rhs(i) - 1.0 end if ! left if ( ix > 0 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i - 1 idx = idx + 1 else rhs(i) = rhs(i) - 0.0 end if ! center csrValA(idx) = -4.0 csrColIndA(idx) = i idx = idx + 1 ! right if ( ix < n_route-1 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i + 1 idx = idx + 1 else rhs(i) = rhs(i) - 0.0 end if ! down if ( iy < n_route-1 ) then csrValA(idx) = 1.0 csrColIndA(idx) = i + n_route idx = idx + 1 else rhs(i) = rhs(i) - 0.0 end if end do csrRowPtrA(n+1) = idx ! check the number of non-zero element print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n print *, " Check a number of non-zero element" print *, " idx =", idx, "nz =", nz ! allocate X allocate( X(n) ) end subroutine laplace_2nd_matgen !============================================================= subroutine BiCGStab(n) !------------------------------------------------------------- ! Bi-Conjugate Gradient Stabilized (BiCGStab) ! A is an m × m sparse matrix defined in CSR storage format !------------------------------------------------------------- use precision_m use measure_time use cublas_v2 use cusparse implicit none integer :: n ! size of matrix type(cusparseHandle) :: cusparse_H type(cublasHandle) :: cublas_H type(cusparseMatDescr) :: descrA type(cusparseMatDescr) :: descrL type(cusparseMatDescr) :: descrU type(cusparseSolveAnalysisInfo) :: infoA, infoL, infoU integer :: cusparse_status integer :: cublas_status integer :: status, ierr_code real(fp_kind) :: alpha, alpham1, beta, omega, omegam1 real(fp_kind) :: rho, rho_p real(fp_kind) :: nrmr, nrmr0, temp1, temp2 real(fp_kind) :: error, rsum, diff real(fp_kind), allocatable, dimension(:) :: valsILU0 real(fp_kind), allocatable, dimension(:) :: Lov, z real(fp_kind), allocatable, dimension(:) :: r, r_tlda real(fp_kind), allocatable, dimension(:) :: p real(fp_kind), allocatable, dimension(:) :: q real(fp_kind), allocatable, dimension(:) :: t, s real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0 integer :: i, j, k allocate( r(n), p(n), r_tlda(n) ) allocate( q(n) ) allocate( valsILU0(nz) ) allocate( Lov(n), z(n), t(n) , s(n) ) ! Bi-Conjugate Gradient Stabilized (BiCGStab) using incomplete-LU print '(/1x,a)',"-------------------------------------------------------------" print '(1x,a)', " Bi-Conjugate Gradient Stabilized (BiCGStab)" print *, "-------------------------------------------------------------" ! Initilize X, rhs X = 0.0 ! or Initial guess X is set ! Copy rhs to r (Initial value r for rhs which is "b" vector) do i = 1, n r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b) end do alpha = 1.0 beta = 0.0 rho = 0.0 k = 0 ! Clock status = init_clock status = start_clock ! creater handle for cuBLAS status = cublasCreate(cublas_H) ! creater handle for CUSPARSE status = cusparseCreate(cusparse_H) ! matrix A descriptor(descrA) and properties status = cusparseCreateMatDescr(descrA) status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL) ! must be GENERAL for using csrilu0 status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base = 1 ! initiialize the analysis info for A matrix (infoA = sparsity pattern of A) status = cusparseCreateSolveAnalysisInfo(infoA) ! initiialize the L U preconditioner Info status = cusparseCreateSolveAnalysisInfo(infoL) status = cusparseCreateSolveAnalysisInfo(infoU) ! Setup for Lower Matrix status = cusparseCreateMatDescr (descrL) status = cusparseSetMatType (descrL, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE) status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER) status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT) ! Setup for Upper Matrix status = cusparseCreateMatDescr (descrU) status = cusparseSetMatType (descrU, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE) status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER) status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT) ! copy original vals to new ILU0-vals valsILU0 = csrValA !print *,"X=",X !print *,"r_init=",r !$acc data copy(r, X) copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA ) & !$acc create(p, q, Lov, z, r_tlda ) create(t, s) ! === Initial guess x0 and residual r = b - Ax0 where r = b !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, r, r_tlda, p ) cusparse_status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0, q is temporary status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) ! r = r - A x0 ! set as r_tlda = r and p = r status = cublasDcopy(cublas_H, n, r, 1, r_tlda, 1) ! r_tlda = r status = cublasDcopy(cublas_H, n, r, 1, p, 1) ! p = r status = cublasDnrm2(cublas_H, n, r, 1, nrmr0) ! nrmr0 = ||r||2 !$acc end host_data print *,"nrmr0=",nrmr0, "tol*nrmr0=",tol*nrmr0 !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0) ! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoL/descrL cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoL) ! the analysis phase of the solution of a sparse linear system for infoU/descrU cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoU) ! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix ! output : valsILU0 cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descrA, valsILU0, & csrRowPtrA, csrColIndA, infoL); !$acc end host_data status = end_clock time_tag = "The analysis phase of ILU(0)" status = print_time(time_tag) status = start_clock do while ( k <= max_iter ) k = k + 1 rho_p = rho !$acc host_data use_device(r_tlda, r) status = cublasDdot(cublas_H, n, r_tlda, 1, r, 1, rho) ! rho = (r_tlda, r) !$acc end host_data !!if ( rho == 0.0 ) then !! print *,"rho is zero, BiCG calculation condition does not hold. STOP!", k, "times" !! exit !!end if if ( k > 1 ) then !!if ( omega == 0.0 ) then !! print *,"omega is zero, BiCG calculation condition does not hold. STOP!", k, "times" !! exit !!end if beta = (rho/rho_p) * (alpha/omega) omegam1 = -omega !$acc host_data use_device(q, p, r) cublas_status = cublasDaxpy(cublas_H, n, omegam1, q, 1, p, 1) ! p = p -omega * q cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta * p cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1) ! p = p + r !$acc end host_data end if ! Solve using valsILU0 vector from Incomplete-LU factorization, Solve M z = p !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, p, Lov, z) ! Forwadaard Solve, L part for infoA(the sparsity pattern of descL) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, & valsILU0, csrRowPtrA, csrColIndA, infoL, p, Lov) ! Back Substitution (U part for infoU = descU) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, & valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z) ! solve z !$acc end host_data ! Compute q = A z !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r_tlda, z, q) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, z, zero, q) ! q = A z status = cublasDdot(cublas_H, n, r_tlda, 1, q, 1, temp1) ! temp1 = (r_tlda, q) !$acc end host_data alpha = rho / temp1 ! alpha = rho / (r_tlda, q) alpham1 = -alpha !$acc host_data use_device(q, z, X, r) cublas_status = cublasDaxpy(cublas_H, n, alpham1, q, 1, r, 1) ! r = r - alpha * q cublas_status = cublasDaxpy(cublas_H, n, alpha, z, 1, X, 1) ! X = X + alpha * z cublas_status = cublasDnrm2(cublas_H, n, r, 1, nrmr) ! nrmr = || r ||2 !$acc end host_data ! check for convergence if ( nrmr < tol*nrmr0 ) exit ! solve M s = r !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, s, t) ! Forward Solve, L part for infoA(the sparsity pattern of descL) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, & valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov) ! Back Substitution (U part for infoU = descU) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, & valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, s) ! solve s ! Compute t = A s cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & ! t = A s n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, s, zero, t) status = cublasDdot(cublas_H, n, t, 1, r, 1, temp1) ! temp1 = (t, r) status = cublasDdot(cublas_H, n, t, 1, t, 1, temp2) ! temp2 = (t, t) !$acc end host_data omega = temp1/temp2 omegam1 = -omega !$acc host_data use_device(s, t, X, r) status = cublasDaxpy(cublas_H, n, omega, s, 1, X, 1) ! X = X + omega * s status = cublasDaxpy(cublas_H, n, omegam1, t, 1, r, 1) ! r = r - omega * t status = cublasDnrm2(cublas_H, n, r, 1, nrmr) ! nrmr = || r ||2 !$acc end host_data ! check for convergence if ( nrmr < tol*nrmr0 ) exit if (idbg == 1 ) print *, "iteration=", k, " residual =", nrmr end do !$acc end data print *, "iteration=", k, " residual =", nrmr if ( nrmr < tol*nrmr0 ) then print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is passed" else print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is failed" end if status = end_clock time_tag = "Elapased time for BiCGStab with ILU(0) :" status = print_time(time_tag) ! check result error = 0.0 do i = 1, n rsum = 0.0 do j = csrRowPtrA(i), csrRowPtrA(i+1)-1 rsum = rsum + csrValA(j) * X(csrColIndA(j)) end do diff = abs(rsum - rhs(i)) if (diff > error) error = diff end do print *,"error =",error ! output result if (idbg == 2) then nprint = n if ( n > 10 ) nprint = 10 do i = 1, nprint print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do print *,"" do i = n-nprint, n print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do end if status = cublasDestroy(cublas_H) status = cusparseDestroy(cusparse_H) status = cusparseDestroyMatDescr(descrA) status = cusparseDestroyMatDescr(descrL) status = cusparseDestroyMatDescr(descrU) status = cusparseDestroySolveAnalysisInfo(infoA) status = cusparseDestroySolveAnalysisInfo(infoL) status = cusparseDestroySolveAnalysisInfo(infoU) deallocate( q, r, r_tlda, p ) deallocate( Lov, z, valsILU0, t, s ) end subroutine BiCGStab End module CG_routines program main use precision_m use CG_routines implicit none integer :: num_mat integer :: i, j !------------------------------------------------------------- ! A matrix : sparse CSR type matrix ! N x N symmetric positive definite matrix (num_mat) !------------------------------------------------------------- num_mat = 16 ! num_mat = 16384 num_mat = 1048576 ! num_mat = 4000000 print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter ! Select matrix data ! call tridaiag_matgen (num_mat) call laplace_2nd_matgen (num_mat) ! Print CSR Matrix data ! print *, "csrValA", csrValA ! print *, "csrRowPtrA",csrRowPtrA ! print *, "csrColIndA", csrColIndA ! Preconditioned Conjugate Gradient using Incomplete-LU ! factorization with 0 fill-in and no pivoting call BiCGStab (num_mat) end program main
コンパイルオプションとして、-acc -Mcudalib=cusparse,cublas -Mcuda オプションを付けてコンパイルリンクする必要がある。例示しているプログラムで使用しているテスト用の係数行列は正定値対称行列である。非対称行列にも対応するため、これについては、後述する。なお、PGI 17.9 から CUDA Toolkit 9.0 ルーチンがコンパイラシステムに統合されたため、-ta=cuda9.0 を指定した。この指定は、cuda7.5 or cuda8.0 でも構わない。
$ pgf90 main.f90 -Mcudalib=cublas,cusparse -acc -ta=tesla,cc60,cuda9.0,cc35 -Minfo=accel -Mcuda -O2
cg:
255, Generating create(q(:))
Generating copy(r(:),x(:))
Generating copyin(csrrowptra(:),csrcolinda(:),csrvala(:))
Generating create(p(:))
iclu_cg:
457, Generating create(q(:))
Generating copyin(valsilu0(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
iccl_cg:
691, Generating create(q(:))
Generating copyin(valscl(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
[kato@photon32 CG]$ a.out
tol = .1000000000D-12 max_iteration = 10000
Matrix size = 1048576 x 1048576
Check a number of non-zero element
idx = 5238785 nz = 5238784
----------------------------------------------
Bi-Conjugate Gradient Stabilized (BiCGStab)
----------------------------------------------
nrmr0= 32.00000000000000 tol*nrmr0= 3.2000000000000001E-012
The analysis phase of ILU(0) 0.625 (sec)
iteration= 981 residual = 3.0842327270017723E-012
Bi-Conjugate Gradient Stabilized (BiCGStab) is passed
Elapased time for BiCGStab with ILU(0) : 39.010 (sec)
error = 2.6423307986078726E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005404E-04
x(1048567)=0.104132673427E-04
x(1048568)=0.937223193541E-05
x(1048569)=0.833110453556E-05
x(1048570)=0.728989536154E-05
x(1048571)=0.624861463278E-05
x(1048572)=0.520727256966E-05
x(1048573)=0.416587939333E-05
x(1048574)=0.312444532565E-05
x(1048575)=0.208298058897E-05
x(1048576)=0.104149540607E-05