大規模なスパース性を有する線形システムの解法は、計算科学、工学のアプリケーション分野において重要な位置づけであり、これらは一般に直接法や反復法の手法が使用されている。直接法を利用した解法では、信頼性は高いがメモリ容量の問題により大規模な問題が解けないことも多く、従来から反復法を用いた解法が利用されている。反復法には様々な解法が存在するが、ここでは、正定値対称行列を扱える Conjugate Gradient(CG) 法と非対称線形システムを対象とする Bi-Conjugate Gradient Stabilized(BiCGStab)法を CUDA Library(cuSPARSE, cuBLAS) を利用して OpenACC プログラミングモデルで作成したプログラムを例示する。なお、使用言語は Fortran とする。
ここで例示するプログラムのアルゴリズムは以下の参考文献を参照し Fortran で組んだものである。
プログラムは、基本的に反復法のアルゴリズムに対して CUDA cuSPARSE/cuBLAS ライブラリを使用してコーディングする形態となる。OpenACC ディレクティブは、デバイス側のデータとなり得る変数・配列に対してディレクティブにより指示を行うだけの用途で使用する。すなわち、OpenACC により配列、変数のデバイスデータのマネージメントだけを行うこと以外は、一般的な Fortran プログラムとなる。CUDAプログラム言語は一切使用しない。
使用する OpenACC のディレクティブは、デバイスデータが使用されるスコープ範囲を指定する data 構文を使用することと、CUDA ライブラリルーチンをコールする際に渡す実引数が「デバイスポインタ」であることを、コンパイルに伝えるために使用する、host_data use_device ディレクティブの利用の二つとなる。
以下の例は、PGI 17.9 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.9-0 64-bit target on x86-64 Linux -tp haswell PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
ここで例示する Fortran プログラムには、(1) 前処理なしの CG アルゴリズムを適用したルーチン(subroutine CG)、(2) 不完全 LU 分解の前処理を施した CG ルーチン(subroutine ICLU_CG)、(3) 不完全コレスキー分解の前処理を施した CG ルーチン(subroutine ICCL_CG) を組み込んだ。いずれも、cuBLAS と cuSPARSE ルーチンを使用して CG 実行を実現している。以下のアルゴリズムは、不完全コレスキー分解の前処理を適用した場合の処理となる。不完全 LU 分解を適用する場合も、分解方法が異なるが基本的に同じ流れとなる。
参照アルゴリズム NVIDIA CUDA Toolkit Documentation
入力する係数行列のストレージ・フォーマットは、CSR 形式としている。この詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。CSR 形式のデータを用意して、このプログラムに入力するインタフェースを作成すれば、各種 CG 実行が可能である。
係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。
| nz | 整数 | 行列内の非ゼロ要素の数 |
| csrValA | 配列 | 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format. |
| csrRowPtrA | 配列 | 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる |
| csrColIndA | 配列 | 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列 |
Incomplete-LU preconditioned CG iterative method
Incomplete-Cholesky preconditioned CG iterative method
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module precision_m
! integer, parameter, public :: fp_kind = kind(0.0) ! Single precision
integer, parameter, public :: fp_kind = selected_real_kind(15,305) ! Double precision
end module precision_m
module measure_time
integer :: clock_start,clock_end,clock_rate
character*40 time_tag
contains
integer function init_clock
call system_clock(count_rate=clock_rate) ! Find the rate
clock_start = 0; clock_end = 0
end function
integer function start_clock
call system_clock(count=clock_start) ! Start timing
end function
integer function end_clock
call system_clock(count=clock_end) ! End timing
end function
integer function print_time(time_tag)
character*40 time_tag
print '(/1x,a,2x,f10.3,a/)', time_tag, &
REAL(clock_end-clock_start,kind(0d0))/REAL(clock_rate,kind(0d0)), " (sec)"
end function
end module measure_time
module CG_routines
use precision_m
implicit none
integer :: nprint
integer, parameter :: idbg = 2
integer, parameter :: max_iter = 10000
real(fp_kind), parameter :: tol = 1.0d-12
integer, parameter :: index_base = 1
! CSR format
integer :: nz ! nonzero element of A (m x n ; n = m)
integer, allocatable, dimension(:) :: csrRowPtrA ! m + 1
integer, allocatable, dimension(:) :: csrColIndA !
real(fp_kind), allocatable, dimension(:) :: csrValA
!
real(fp_kind), allocatable, dimension(:) :: X
real(fp_kind), allocatable, dimension(:) :: rhs
contains
subroutine tridaiag_matgen(n)
!-------------------------------------------------------------
! genTridiag: generate a random tridiagonal symmetric matrix
! A matrix : sparse CSR storage format
! N x N symmetric positive definite matrix
!-------------------------------------------------------------
integer :: n ! size of matrix A
integer :: i, js
real(fp_kind) :: rnd, rand
external rand
nz = (n-2)*3 + 4 ! non-zero elements
allocate( csrRowPtrA(n+1) )
allocate( csrColIndA(nz) )
allocate( csrValA(nz) )
! matrix A
call random_number(rnd)
csrRowPtrA(1) = 1 ! i = 1
csrColIndA(1) = 1 ; csrValA(1) = rnd + 10.0
csrColIndA(2) = 2 ; csrValA(2) = rnd
do i = 2, n
if ( i > 2 ) then
csrRowPtrA(i) = csrRowPtrA(i-1) + 3
else
csrRowPtrA(2) = 3
end if
js = (i-2)*3 + 2 + index_base
csrColIndA(js) = i - 1
csrColIndA(js+1) = i
csrColIndA(js+2) = i + 1
call random_number(rnd)
csrValA(js) = csrValA(js-1)
csrValA(js+1) = rnd + 10.0
if ( i < n ) then
call random_number(rnd)
csrValA(js+2) = rnd
end if
end do
csrRowPtrA(n+1) = nz + csrRowPtrA(1)
! allocate X, rhs
allocate( X(n) )
allocate( rhs(n) )
! Initilize rhs
rhs = 1.0
! check the number of non-zero element
print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
print *, " Check a number of non-zero element"
print *, " nz =", nz
end subroutine tridaiag_matgen
subroutine laplace_2nd_matgen(n)
!-------------------------------------------------------------
! genTridiag: generate a tridiagonal symmetric matrix
! A matrix : sparse CSR storage format
! a second order, regular, Laplacian operator
!-------------------------------------------------------------
integer :: n ! size of matrix A
integer :: n_route
integer :: i, idx
integer :: ix, iy
if ( n /= sqrt(real(n))*sqrt(real(n)) ) stop "N number"
nz = 5*N-4*int(sqrt(real(n))) ! non-zero elements
allocate( csrRowPtrA(1:n+1) )
allocate( csrColIndA(1:nz) )
allocate( csrValA(1:nz) )
! allocate X, rhs
allocate( rhs(1:n) ) ; rhs = 0.0 ! Initialize RHS
n_route = sqrt(real(n)) ! laplace dimension
idx = 1
! matrix A
do i = 1, n
ix = mod(i-1,n_route)
iy = (i-1)/n_route
!print *,"ix,iy=",ix,iy
csrRowPtrA(i) = idx
! up
if ( iy > 0 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i - n_route
idx = idx + 1
else
rhs(i) = rhs(i) - 1.0
end if
! left
if ( ix > 0 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i - 1
idx = idx + 1
else
rhs(i) = rhs(i) - 0.0
end if
! center
csrValA(idx) = -4.0
csrColIndA(idx) = i
idx = idx + 1
! right
if ( ix < n_route-1 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i + 1
idx = idx + 1
else
rhs(i) = rhs(i) - 0.0
end if
! down
if ( iy < n_route-1 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i + n_route
idx = idx + 1
else
rhs(i) = rhs(i) - 0.0
end if
end do
csrRowPtrA(n+1) = idx
! check the number of non-zero element
print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
print *, " Check a number of non-zero element"
print *, " idx =", idx, "nz =", nz
! allocate X, rhs
allocate( X(n) )
end subroutine laplace_2nd_matgen
!==============================================
subroutine CG(n)
!----------------------------------------------
! Conjugate gradient without preconditioning
!----------------------------------------------
use precision_m
use measure_time
use cublas_v2
use cusparse
implicit none
integer :: n ! size of matrix
type(cusparseHandle) :: cusparse_H
type(cublasHandle) :: cublas_H
type(cusparseMatDescr) :: descrA
integer :: cusparse_status
integer :: cublas_status
integer :: status, ierr_code
real(fp_kind) :: r0, r1, alpha, alpham1, beta
real(fp_kind) :: dot, nalpha
real(fp_kind) :: error, rsum, diff
real(fp_kind), allocatable, dimension(:) :: r
real(fp_kind), allocatable, dimension(:) :: p
real(fp_kind), allocatable, dimension(:) :: q
real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0
integer :: i, j, k
allocate( r(n), p(n) )
allocate( q(n) )
! Conjugate gradient without preconditioning
print *,"-------------------------------------------------------------"
print *,"Conjugate gradient without preconditioning"
print *,"-------------------------------------------------------------"
! Initilize X, rhs
X = 0.0 ! or Initial guess X is set
! Copy rhs to r (Initial value r for rhs which is "b" vector)
do i = 1, n
r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b)
end do
! creater handle for cuBLAS
status = cublasCreate(cublas_H)
! creater handle for CUSPARSE and matrix A descriptor
status = cusparseCreate(cusparse_H)
if (status /= CUSPARSE_STATUS_SUCCESS) write(*,*) 'cusparseCreate error: ', status
status = cusparseCreateMatDescr(descrA)
! matrix A properties
status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)
status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base = 1
alpha = 1.0
beta = 0.0
k = 0
r0 = 0.0
status = init_clock
status = start_clock
!$acc data copy(r, X) copyin(csrValA, csrRowPtrA, csrColIndA) &
!$acc create(p, q)
! === Initial guess x0 and residual r1 = r - Ax0
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, q, X, r)
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0
cublas_status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) ! r = r - q
cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1= (r, r)
!$acc end host_data
do while ( r1 > tol*tol .and. k < max_iter )
k = k + 1
if ( k == 1 ) then
!$acc host_data use_device(r,p)
cublas_status = cublasDcopy(cublas_H, n, r, 1, p, 1) ! p = r
!$acc end host_data
else
beta = r1/r0; ! beta = r(i) / r(i-1)
! p = r + beta * p
!$acc host_data use_device(r,p)
cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta*p
cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1) ! p = r + p
!$acc end host_data
end if
! Compute q = A p
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, q ,p)
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, p, zero, q)
cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, dot) ! dot = p{T} * q
!$acc end host_data
alpha = r1/dot
! x = x + alpha*p
!$acc host_data use_device(p, X)
cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1)
!$acc end host_data
nalpha = -alpha
! r = r - alpha*q
!$acc host_data use_device(q, r)
cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1)
!$acc end host_data
r0 = r1 ! r0 = r1
!$acc host_data use_device(r)
cublas_status = cublasDdot(cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r)
!$acc end host_data
if (idbg == 1 ) print *, "iteration=", k, " residual =", sqrt(r1)
end do
!$acc end data
status = end_clock
print *, "iteration=", k, " residual =", sqrt(r1)
if ( sqrt(r1) < tol ) then
print *, "CG without preconditioning test is passed"
else
print *, "CG without preconditioning test is failed"
end if
time_tag = "Elapased time CG without preconditioning : "
status = print_time(time_tag)
! print *, "X=",X
! check result
error = 0.0
do i = 1, n
rsum = 0.0
do j = csrRowPtrA(i), csrRowPtrA(i+1)-1
rsum = rsum + csrValA(j) * X(csrColIndA(j))
end do
diff = abs(rsum - rhs(i))
if (diff > error) error = diff
end do
print *,"error =",error
! output result
if (idbg == 2) then
nprint = n
if ( n > 10 ) nprint = 10
do i = 1, nprint
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
print *,""
do i = n-nprint, n
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
end if
status = cublasDestroy(cublas_H)
status = cusparseDestroy(cusparse_H)
deallocate( q, r, p )
end subroutine CG
!======================================================
subroutine ICLU_CG(n)
!------------------------------------------------------
! Incomplete-LU preconditioned CG -- ILU(0)
! 0 fill-in and no pivoting
! A is an m × m sparse matrix defined in CSR storage format
!------------------------------------------------------
use precision_m
use measure_time
use cublas_v2
use cusparse
implicit none
integer :: n ! size of matrix
type(cusparseHandle) :: cusparse_H
type(cublasHandle) :: cublas_H
type(cusparseMatDescr) :: descr2
type(cusparseMatDescr) :: descrL
type(cusparseMatDescr) :: descrU
type(cusparseSolveAnalysisInfo) :: infoSolA, infoILU
integer :: cusparse_status
integer :: cublas_status
integer :: status, ierr_code
real(fp_kind) :: r1, alpha, alpham1, beta
real(fp_kind) :: abc, efg
real(fp_kind) :: nalpha
real(fp_kind) :: error, rsum, diff
real(fp_kind), allocatable, dimension(:) :: valsILU0
real(fp_kind), allocatable, dimension(:) :: Lov, z, r_p, z_p
real(fp_kind), allocatable, dimension(:) :: r
real(fp_kind), allocatable, dimension(:) :: p
real(fp_kind), allocatable, dimension(:) :: q
real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0
integer :: i, j, k
allocate( r(n), p(n) )
allocate( q(n) )
allocate( valsILU0(nz) )
allocate( Lov(n), z(n), r_p(n) ,z_p(n) )
! Conjugate gradient using incomplete-LU
print '(/1x,a)',"-------------------------------------------------------------"
print '(1x,a)', "Preconditioned Conjugate Gradient using incomplete-LU(0)"
print *, "-------------------------------------------------------------"
! Initilize X, rhs
X = 0.0 ! or Initial guess X is set
! Copy rhs to r (Initial value r for rhs which is "b" vector)
do i = 1, n
r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b)
end do
alpha = 1.0
beta = 0.0
k = 0
! Clock
status = init_clock
status = start_clock
! creater handle for cuBLAS
status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
status = cusparseCreate(cusparse_H)
! matrix A descriptor(descr2) and properties
status = cusparseCreateMatDescr(descr2)
status = cusparseSetMatType(descr2, CUSPARSE_MATRIX_TYPE_GENERAL) ! must be GENERAL for using csrilu0
status = cusparseSetMatIndexBase(descr2, CUSPARSE_INDEX_BASE_ONE) ! base = 1
! initiialize the analysis info for A matrix (InfoSolA = sparsity pattern of A)
status = cusparseCreateSolveAnalysisInfo(infoSolA)
! initiialize the ILU0 preconditioner Info (infoILU)
status = cusparseCreateSolveAnalysisInfo(infoILU)
! Setup for Lower Matrix
status = cusparseCreateMatDescr (descrL)
status = cusparseSetMatType (descrL, CUSPARSE_MATRIX_TYPE_GENERAL)
status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT)
! Setup for Upper Matrix
status = cusparseCreateMatDescr (descrU)
status = cusparseSetMatType (descrU, CUSPARSE_MATRIX_TYPE_GENERAL)
status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE)
status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER)
status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT)
! copy original vals to new ILU0-vals
valsILU0 = csrValA
!$acc data copy(r, X) copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA) &
!$acc create(p, q, Lov, z, r_p, z_p)
! === Initial guess x0 and residual r1 = r - Ax0
!$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, q, r)
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descr2, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0
status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) ! r = r - q
status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r)
!$acc end host_data
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0)
! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoSolA/descr2
cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, nz, descr2, csrValA, csrRowPtrA, csrColIndA, infoSolA)
! the analysis phase of the solution of a sparse linear system for infoILU0/descrU
cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoILU)
! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix
! output : valsILU0
cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descr2, valsILU0, &
csrRowPtrA, csrColIndA, infoSolA);
!$acc end host_data
status = end_clock
time_tag = "The analysis phase of ILU(0)"
status = print_time(time_tag)
status = start_clock
do while ( r1 > tol*tol .and. k <= max_iter )
! Solve using valsILU0 vector from Incomplete-LU factorization
!$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, z)
! Forward Solve, L part for infoSolA(the sparsity pattern of A = descL)
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
valsILU0, csrRowPtrA, csrColIndA, infoSolA, r, Lov)
! Back Substitution (U part for infoILU = descU)
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
valsILU0, csrRowPtrA, csrColIndA, infoILU, Lov, z) ! solve z
!$acc end host_data
k = k + 1
if ( k == 1 ) then
!$acc host_data use_device(z, p)
cublas_status = cublasDcopy(cublas_H, n, z, 1, p, 1) ! p = z
!$acc end host_data
else
!$acc host_data use_device(r, z, r_p, z_p)
cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z) r(i)
cublas_status = cublasDdot(cublas_H, n, r_p, 1, z_p, 1, efg) ! efg = (r_p, z_p) r(i-1)
!$acc end host_data
beta = abc / efg ! beta = r(i) / r(i-1)
! p = z + beta * p
!$acc host_data use_device(z,p)
cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta*p
cublas_status = cublasDaxpy(cublas_H, n, one, z, 1, p, 1) ! p = z + p
!$acc end host_data
end if
! Compute q = A p
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r, z, p, q)
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descrU, csrValA, csrRowPtrA, csrColIndA, p, zero, q);
cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z)
cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, efg) ! efg = (p, q)
!$acc end host_data
alpha = abc / efg
nalpha = -alpha
!$acc host_data use_device(p, X, r, r_p, z, z_p, q)
cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1) ! X = X + alpha * p
cublas_status = cublasDcopy(cublas_H, n, r, 1, r_p, 1) ! r_p = r
cublas_status = cublasDcopy(cublas_H, n, z, 1, z_p, 1) ! z_p = z
cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1) ! r = r - alpha * q
cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r)
!$acc end host_data
if (idbg == 1 ) print *, "iteration=", k, " residual =", sqrt(r1)
end do
!$acc end data
print *, "iteration=", k, " residual =", sqrt(r1)
if ( sqrt(r1) < tol ) then
print *, "Preconditioned Conjugate Gradient using ILU is passed"
else
print *, "Preconditioned Conjugate Gradient using ILU is failed"
end if
status = end_clock
time_tag = "Elapased time for GG with ILU(0) :"
status = print_time(time_tag)
! check result
error = 0.0
do i = 1, n
rsum = 0.0
do j = csrRowPtrA(i), csrRowPtrA(i+1)-1
rsum = rsum + csrValA(j) * X(csrColIndA(j))
end do
diff = abs(rsum - rhs(i))
if (diff > error) error = diff
end do
print *,"error =",error
! output result
if (idbg == 2) then
nprint = n
if ( n > 10 ) nprint = 10
do i = 1, nprint
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
print *,""
do i = n-nprint, n
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
end if
status = cublasDestroy(cublas_H)
status = cusparseDestroy(cusparse_H)
status = cusparseDestroyMatDescr(descr2)
status = cusparseDestroyMatDescr(descrL)
status = cusparseDestroyMatDescr(descrU)
status = cusparseDestroySolveAnalysisInfo(infoSolA)
status = cusparseDestroySolveAnalysisInfo(infoILU)
deallocate( q, r, p )
deallocate( Lov, z, valsILU0, r_p, z_p )
end subroutine ICLU_CG
!======================================================
subroutine ICCL_CG(n)
!------------------------------------------------------
! Incomplete-Cholesky preconditioned CG -- IC(0)
! 0 fill-in and no pivoting
! A is an m × m Hermitian/symmetric positive definite sparse matrix
! defined in CSR storage format
!------------------------------------------------------
use precision_m
use measure_time
use cublas_v2
use cusparse
implicit none
integer :: n ! size of matrix
type(cusparseHandle) :: cusparse_H
type(cublasHandle) :: cublas_H
type(cusparseMatDescr) :: descr3
type(cusparseMatDescr) :: descrL
type(cusparseSolveAnalysisInfo) :: infoSolA, infoCL, infoCLt
integer :: cusparse_status
integer :: cublas_status
integer :: status, ierr_code
real(fp_kind) :: r1, alpha, alpham1, beta
real(fp_kind) :: abc, efg
real(fp_kind) :: nalpha
real(fp_kind) :: error, rsum, diff
real(fp_kind), allocatable, dimension(:) :: valsCL
real(fp_kind), allocatable, dimension(:) :: Lov, z, r_p, z_p
real(fp_kind), allocatable, dimension(:) :: r
real(fp_kind), allocatable, dimension(:) :: p
real(fp_kind), allocatable, dimension(:) :: q
real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0
integer :: i, j, k
allocate( r(n), p(n) )
allocate( q(n) )
allocate( valsCL(nz) )
allocate( Lov(n), z(n), r_p(n) ,z_p(n) )
! Conjugate gradient using incomplete-Cholesky for Symmetric matrix
print '(/1x,a)',"-------------------------------------------------------------"
print '(1x,a)', "Preconditioned Conjugate Gradient using incomplete-Cholesky(0)"
print *, "-------------------------------------------------------------"
! Initilize X, rhs
X = 0.0 ! or Initial guess X is set
! Copy rhs to r (Initial value r for rhs which is "b" vector)
do i = 1, n
r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b)
end do
alpha = 1.0
beta = 0.0
k = 0
! Clock
status = init_clock
status = start_clock
! creater handle for cuBLAS
status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
status = cusparseCreate(cusparse_H)
! matrix A descriptor(descr3) and properties
status = cusparseCreateMatDescr(descr3)
status = cusparseSetMatType(descr3, CUSPARSE_MATRIX_TYPE_GENERAL) ! descr3 as a "GENERAL" or "SYMMETRIC"
status = cusparseSetMatIndexBase(descr3, CUSPARSE_INDEX_BASE_ONE) ! base = 1
! CAUTION - this program don't generate a lower triangular data for the matrix A
! So, this program don't use a lower triangular data for the incomplete cholesky preconditioner
! initiialize the analysis info for A matrix (InfoSolA = sparsity pattern of A)
status = cusparseCreateSolveAnalysisInfo(infoSolA)
! initiialize the CL(0) preconditioner Info
status = cusparseCreateSolveAnalysisInfo(infoCL)
status = cusparseCreateSolveAnalysisInfo(infoCLt)
! Setup for Lower Matrix
status = cusparseCreateMatDescr (descrL)
status = cusparseSetMatType (descrL, CUSPARSE_MATRIX_TYPE_GENERAL) ! not TRIANGULAR
status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_NON_UNIT)
! copy original vals to new vals-CL
valsCL = csrValA
!$acc data copy(r, X) copyin(valsCL, csrValA, csrRowPtrA, csrColIndA) &
!$acc create(p, q, Lov, z, r_p, z_p)
! === Initial guess x0 and r1 = r - Ax0
!$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, q, r)
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descr3, csrValA, csrRowPtrA, csrColIndA, X, zero, q);
status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1)
status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! Initial r1 = (r, r)
!$acc end host_data
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA)
! the analysis phase of the solution of a sparse linear system for infoCL/descrL
cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoCL)
cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_TRANSPOSE, &
n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoCLt)
!$acc end host_data
status = end_clock
time_tag = "The analysis phase of IC(0)"
status = print_time(time_tag)
status = start_clock
do while ( r1 > tol*tol .and. k <= max_iter )
! Solve using valsCL vector from Incomplete-Cholesky factorization M z = r
!$acc host_data use_device(valsCL, csrRowPtrA, csrColIndA, r, Lov, z)
! infoCLt triangular solve
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_TRANSPOSE, n, one, descrL, &
valsCL, csrRowPtrA, csrColIndA, infoCLt, r, Lov)
! InfoCL triangular solve
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
valsCL, csrRowPtrA, csrColIndA, infoCL, Lov, z) ! solve z
!$acc end host_data
k = k + 1
if ( k == 1 ) then
!$acc host_data use_device(z, p)
cublas_status = cublasDcopy(cublas_H, n, z, 1, p, 1) ! p = z
!$acc end host_data
else
!$acc host_data use_device(r, z, r_p, z_p)
cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z) r(i)
cublas_status = cublasDdot(cublas_H, n, r_p, 1, z_p, 1, efg) ! efg = (r_p, z_p) r(i-1)
!$acc end host_data
beta = abc / efg ! beta = r(i) / r(i-1)
! p = z + beta * p
!$acc host_data use_device(z,p)
cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta*p
cublas_status = cublasDaxpy(cublas_H, n, one, z, 1, p, 1) ! p = z + p
!$acc end host_data
end if
! Compute q = A p
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r, z, p, q)
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descr3, csrValA, csrRowPtrA, csrColIndA, p, zero, q);
cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc) ! abc = (r, z)
cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, efg) ! efg = (p, q)
!$acc end host_data
alpha = abc / efg
nalpha = -alpha
!$acc host_data use_device(p, X, r, r_p, z, z_p, q)
cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1) ! X = X + alpha * p
cublas_status = cublasDcopy(cublas_H, n, r, 1, r_p, 1) ! r_p = r
cublas_status = cublasDcopy(cublas_H, n, z, 1, z_p, 1) ! z_p = z
cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1) ! r = r - alpha * q
cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1) ! r1 = (r, r)
!$acc end host_data
if (idbg == 1 ) print *, "iteration=", k, " residual =", sqrt(r1)
end do
!$acc end data
print *, "iteration=", k, " residual =", sqrt(r1)
if ( sqrt(r1) < tol ) then
print *, "Preconditioned Conjugate Gradient using cholesky is passed"
else
print *, "Preconditioned Conjugate Gradient using cholesky is failed"
end if
status = end_clock
time_tag = "Elapased time for GG with IC(0) :"
status = print_time(time_tag)
! check result
error = 0.0
do i = 1, n
rsum = 0.0
do j = csrRowPtrA(i), csrRowPtrA(i+1)-1
rsum = rsum + csrValA(j) * X(csrColIndA(j))
end do
diff = abs(rsum - rhs(i))
if (diff > error) error = diff
end do
print *,"error =",error
! output result
if (idbg == 2) then
nprint = n
if ( n > 10 ) nprint = 10
do i = 1, nprint
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
print *,""
do i = n-nprint, n
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
end if
status = cublasDestroy(cublas_H)
status = cusparseDestroy(cusparse_H)
status = cusparseDestroyMatDescr(descr3)
status = cusparseDestroyMatDescr(descrL)
status = cusparseDestroySolveAnalysisInfo(infoSolA)
status = cusparseDestroySolveAnalysisInfo(infoCL)
status = cusparseDestroySolveAnalysisInfo(infoCLt)
deallocate( q, r, p )
deallocate( Lov, z, valsCL, r_p, z_p )
end subroutine ICCL_CG
End module CG_routines
program main
use precision_m
use CG_routines
implicit none
integer :: num_mat
integer :: i, j
!-------------------------------------------------------------
! A matrix : sparse CSR type matrix
! N x N symmetric positive definite matrix (num_mat)
!-------------------------------------------------------------
num_mat = 16
num_mat = 16384
! num_mat = 1048576
! num_mat = 4000000
print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter
! Select matrix data
! call tridaiag_matgen (num_mat)
call laplace_2nd_matgen (num_mat)
! Print CSR Matrix data
! print *, "csrValA", csrValA
! print *, "csrRowPtrA",csrRowPtrA
! print *, "csrColIndA", csrColIndA
! Conjugate gradient without preconditioning
call CG (num_mat)
! Preconditioned Conjugate Gradient using Incomplete-LU
! factorization with 0 fill-in and no pivoting
call ICLU_CG (num_mat)
! Preconditioned Conjugate Gradient using Incomplete-Cholesky
! factorization
call ICCL_CG (num_mat)
end program main
コンパイルオプションとして、-acc -Mcudalib=cusparse,cublas -Mcuda オプションを付けてコンパイルリンクする必要がある。例示しているプログラムで使用しているテスト用の係数行列は、極めて収束性の良いデータを使用しているため、前処理なしの CG 計算が速い結果となっているが、実用上での係数行列の特性は固有値分布が様々なため、 CG 計算では収束出来ない場合も生じる。また、「前処理」を実施したとしても係数行列の素性によっては収束できない場合も多い。上記のプログラムは、OpenACC + CUDA library を使用して CG 法の実装モデルを示したにすぎない。実際の利用としては、対称行列に対象を限った不完全コレスキー分解の前処理付きルーチンよりも、不完全 LU 分解の前処理を施した CG ルーチン(subroutine ICLU_CG)を推奨する。あるいは、この後説明する BiCGStab 法を実装した CG ルーチン を利用する方が良いだろう。なお、PGI 17.9から CUDA Toolkit 9.0 ルーチンがコンパイラシステムに統合されたため、-ta=cuda9.0 を指定した。この指定は、cuda7.5 or cuda8.0 でも構わない。
$ pgf90 main.f90 -Mcudalib=cublas,cusparse -acc -ta=tesla,cc60,cuda9.0,cc35 -Minfo=accel -Mcuda -O2
cg:
255, Generating create(q(:))
Generating copy(r(:),x(:))
Generating copyin(csrrowptra(:),csrcolinda(:),csrvala(:))
Generating create(p(:))
iclu_cg:
457, Generating create(q(:))
Generating copyin(valsilu0(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
iccl_cg:
691, Generating create(q(:))
Generating copyin(valscl(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
[kato@photon32 CG]$ a.out
tol = .1000000000D-11 max_iteration = 10000
Matrix size = 1048576 x 1048576
Check a number of non-zero element
idx = 5238785 nz = 5238784
-------------------------------------------------------------
Conjugate gradient without preconditioning
-------------------------------------------------------------
iteration= 3531 residual = 9.9759469342430417E-013
CG without preconditioning test is passed
Elapased time CG without preconditioning 2.616 (sec)
error = 3.7858605139717838E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005659E-04
x(1048567)=0.104132673657E-04
x(1048568)=0.937223195617E-05
x(1048569)=0.833110455416E-05
x(1048570)=0.728989537764E-05
x(1048571)=0.624861464650E-05
x(1048572)=0.520727258128E-05
x(1048573)=0.416587940269E-05
x(1048574)=0.312444533254E-05
x(1048575)=0.208298059361E-05
x(1048576)=0.104149540857E-05
-------------------------------------------------------------
Preconditioned Conjugate Gradient using incomplete-LU(0)
-------------------------------------------------------------
The analysis phase of ILU(0) 0.151 (sec)
iteration= 1237 residual = 9.8237116899063428E-013
Preconditioned Conjugate Gradient using ILU is passed
Elapased time for GG with ILU(0) : 24.738 (sec)
error = 2.3092638912203256E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005653E-04
x(1048567)=0.104132673654E-04
x(1048568)=0.937223195581E-05
x(1048569)=0.833110455365E-05
x(1048570)=0.728989537732E-05
x(1048571)=0.624861464626E-05
x(1048572)=0.520727258084E-05
x(1048573)=0.416587940225E-05
x(1048574)=0.312444533231E-05
x(1048575)=0.208298059340E-05
x(1048576)=0.104149540829E-05
-------------------------------------------------------------
Preconditioned Conjugate Gradient using incomplete-Cholesky(0)
-------------------------------------------------------------
The analysis phase of IC(0) 0.138 (sec)
iteration= 1468 residual = 9.9588119511240996E-013
Preconditioned Conjugate Gradient using cholesky is passed
Elapased time for GG with IC(0) : 29.747 (sec)
error = 2.4313884239290928E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005654E-04
x(1048567)=0.104132673654E-04
x(1048568)=0.937223195580E-05
x(1048569)=0.833110455365E-05
x(1048570)=0.728989537735E-05
x(1048571)=0.624861464631E-05
x(1048572)=0.520727258092E-05
x(1048573)=0.416587940233E-05
x(1048574)=0.312444533239E-05
x(1048575)=0.208298059346E-05
x(1048576)=0.104149540832E-05
次に、非対称および対称正定値(spd)線形システムに適用できる、非定常反復法の一つである Bi-Conjugate Gradient Stabilized (BiCGStab) を cuBLAS ならびに cuSPARSE ルーチンを利用して作成した Fortran プログラムを紹介する。OpenACC ディレクティブを使用して、デバイス側のデータとなり得る変数・配列引数を指定するだけで、GPU上で動作するコードが作成できる。CUDA C/Fortran 言語で記述する際のようなデバイス用の変数、配列の割付等は一切不要である。実装アルゴリズムは、以下の NVIDIA Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS で説明されている内容を適用した。
反復法の収束性を早めるために、不完全 LU 分解による前処理を施している。係数行列の不完全 LU 分解は、0 level fill-in and no pivoting の cusparseDcsrilu0 ルーチンを使用する。これによって、反復計算前に、線形システムの係数行列のスペクトルを変更する。
入力する係数行列のストレージ・フォーマットは、CSR 形式としている。この詳細は、上述した。CSR 形式のデータを用意して、このプログラムに入力するインタフェースを作成すれば、BiCGStab 実行が可能である。
参照アルゴリズム NVIDIA CUDA Toolkit Documentation
! Copyright (c) 2017, Tsutomu Kato, SofTek Systems, Inc. All rights reserved.
!
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module precision_m
! integer, parameter, public :: fp_kind = kind(0.0) ! Single precision
integer, parameter, public :: fp_kind = selected_real_kind(15,305) ! Double precision
end module precision_m
module measure_time
integer :: clock_start,clock_end,clock_rate
character*40 time_tag
contains
integer function init_clock
call system_clock(count_rate=clock_rate) ! Find the rate
clock_start = 0; clock_end = 0
end function
integer function start_clock
call system_clock(count=clock_start) ! Start timing
end function
integer function end_clock
call system_clock(count=clock_end) ! End timing
end function
integer function print_time(time_tag)
character*40 time_tag
print '(/1x,a,2x,f10.3,a/)', time_tag, &
REAL(clock_end-clock_start,kind(0d0))/REAL(clock_rate,kind(0d0)), " (sec)"
end function
end module measure_time
module CG_routines
use precision_m
implicit none
integer :: nprint
integer, parameter :: idbg = 2
integer, parameter :: max_iter = 10000
real(fp_kind), parameter :: tol = 1.0d-13
integer, parameter :: index_base = 1
integer :: nz ! nonzero element of A (m x n ; n = m)
integer, allocatable, dimension(:) :: csrRowPtrA ! m + 1
integer, allocatable, dimension(:) :: csrColIndA !
real(fp_kind), allocatable, dimension(:) :: csrValA
!
real(fp_kind), allocatable, dimension(:) :: X
real(fp_kind), allocatable, dimension(:) :: rhs
contains
subroutine tridaiag_matgen(n)
!-------------------------------------------------------------
! genTridiag: generate a random tridiagonal symmetric matrix
! A matrix : sparse CSR storage format
! N x N symmetric positive definite matrix
!-------------------------------------------------------------
integer :: n ! size of matrix A
integer :: i, js
real(fp_kind) :: rnd, rand
external rand
nz = (n-2)*3 + 4 ! non-zero elements
allocate( csrRowPtrA(n+1) )
allocate( csrColIndA(nz) )
allocate( csrValA(nz) )
! matrix A
call random_number(rnd)
csrRowPtrA(1) = 1 ! i = 1
csrColIndA(1) = 1 ; csrValA(1) = rnd + 10.0
csrColIndA(2) = 2 ; csrValA(2) = rnd
do i = 2, n
if ( i > 2 ) then
csrRowPtrA(i) = csrRowPtrA(i-1) + 3
else
csrRowPtrA(2) = 3
end if
js = (i-2)*3 + 2 + index_base
csrColIndA(js) = i - 1
csrColIndA(js+1) = i
csrColIndA(js+2) = i + 1
call random_number(rnd)
csrValA(js) = csrValA(js-1)
csrValA(js+1) = rnd + 10.0
if ( i < n ) then
call random_number(rnd)
csrValA(js+2) = rnd
end if
end do
csrRowPtrA(n+1) = nz + csrRowPtrA(1)
! allocate X, rhs
allocate( X(n) )
allocate( rhs(n) )
! Initilize rhs
rhs = 1.0
! check the number of non-zero element
print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
print *, " Check a number of non-zero element"
print *, " nz =", nz
end subroutine tridaiag_matgen
subroutine laplace_2nd_matgen(n)
!-------------------------------------------------------------
! genTridiag: generate a tridiagonal symmetric matrix
! A matrix : sparse CSR storage format
! a second order, regular, Laplacian operator
!-------------------------------------------------------------
integer :: n ! size of matrix A
integer :: n_route
integer :: i, idx
integer :: ix, iy
if ( n /= sqrt(real(n))*sqrt(real(n)) ) stop "N number"
nz = 5*N-4*int(sqrt(real(n))) ! non-zero elements
allocate( csrRowPtrA(1:n+1) )
allocate( csrColIndA(1:nz) )
allocate( csrValA(1:nz) )
! allocate rhs
allocate( rhs(1:n) ) ; rhs =0.0 ! Initialize RHS
n_route = sqrt(real(n)) ! laplace dimension
idx = 1
! matrix A
do i = 1, n
ix = mod(i-1,n_route)
iy = (i-1)/n_route
!print *,"ix,iy=",ix,iy
csrRowPtrA(i) = idx
! up
if ( iy > 0 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i - n_route
idx = idx + 1
else
rhs(i) = rhs(i) - 1.0
end if
! left
if ( ix > 0 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i - 1
idx = idx + 1
else
rhs(i) = rhs(i) - 0.0
end if
! center
csrValA(idx) = -4.0
csrColIndA(idx) = i
idx = idx + 1
! right
if ( ix < n_route-1 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i + 1
idx = idx + 1
else
rhs(i) = rhs(i) - 0.0
end if
! down
if ( iy < n_route-1 ) then
csrValA(idx) = 1.0
csrColIndA(idx) = i + n_route
idx = idx + 1
else
rhs(i) = rhs(i) - 0.0
end if
end do
csrRowPtrA(n+1) = idx
! check the number of non-zero element
print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
print *, " Check a number of non-zero element"
print *, " idx =", idx, "nz =", nz
! allocate X
allocate( X(n) )
end subroutine laplace_2nd_matgen
!=============================================================
subroutine BiCGStab(n)
!-------------------------------------------------------------
! Bi-Conjugate Gradient Stabilized (BiCGStab)
! A is an m × m sparse matrix defined in CSR storage format
!-------------------------------------------------------------
use precision_m
use measure_time
use cublas_v2
use cusparse
implicit none
integer :: n ! size of matrix
type(cusparseHandle) :: cusparse_H
type(cublasHandle) :: cublas_H
type(cusparseMatDescr) :: descrA
type(cusparseMatDescr) :: descrL
type(cusparseMatDescr) :: descrU
type(cusparseSolveAnalysisInfo) :: infoA, infoL, infoU
integer :: cusparse_status
integer :: cublas_status
integer :: status, ierr_code
real(fp_kind) :: alpha, alpham1, beta, omega, omegam1
real(fp_kind) :: rho, rho_p
real(fp_kind) :: nrmr, nrmr0, temp1, temp2
real(fp_kind) :: error, rsum, diff
real(fp_kind), allocatable, dimension(:) :: valsILU0
real(fp_kind), allocatable, dimension(:) :: Lov, z
real(fp_kind), allocatable, dimension(:) :: r, r_tlda
real(fp_kind), allocatable, dimension(:) :: p
real(fp_kind), allocatable, dimension(:) :: q
real(fp_kind), allocatable, dimension(:) :: t, s
real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0
integer :: i, j, k
allocate( r(n), p(n), r_tlda(n) )
allocate( q(n) )
allocate( valsILU0(nz) )
allocate( Lov(n), z(n), t(n) , s(n) )
! Bi-Conjugate Gradient Stabilized (BiCGStab) using incomplete-LU
print '(/1x,a)',"-------------------------------------------------------------"
print '(1x,a)', " Bi-Conjugate Gradient Stabilized (BiCGStab)"
print *, "-------------------------------------------------------------"
! Initilize X, rhs
X = 0.0 ! or Initial guess X is set
! Copy rhs to r (Initial value r for rhs which is "b" vector)
do i = 1, n
r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b)
end do
alpha = 1.0
beta = 0.0
rho = 0.0
k = 0
! Clock
status = init_clock
status = start_clock
! creater handle for cuBLAS
status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
status = cusparseCreate(cusparse_H)
! matrix A descriptor(descrA) and properties
status = cusparseCreateMatDescr(descrA)
status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL) ! must be GENERAL for using csrilu0
status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base = 1
! initiialize the analysis info for A matrix (infoA = sparsity pattern of A)
status = cusparseCreateSolveAnalysisInfo(infoA)
! initiialize the L U preconditioner Info
status = cusparseCreateSolveAnalysisInfo(infoL)
status = cusparseCreateSolveAnalysisInfo(infoU)
! Setup for Lower Matrix
status = cusparseCreateMatDescr (descrL)
status = cusparseSetMatType (descrL, CUSPARSE_MATRIX_TYPE_GENERAL)
status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT)
! Setup for Upper Matrix
status = cusparseCreateMatDescr (descrU)
status = cusparseSetMatType (descrU, CUSPARSE_MATRIX_TYPE_GENERAL)
status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE)
status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER)
status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT)
! copy original vals to new ILU0-vals
valsILU0 = csrValA
!print *,"X=",X
!print *,"r_init=",r
!$acc data copy(r, X) copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA ) &
!$acc create(p, q, Lov, z, r_tlda ) create(t, s)
! === Initial guess x0 and residual r = b - Ax0 where r = b
!$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, r, r_tlda, p )
cusparse_status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0, q is temporary
status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) ! r = r - A x0
! set as r_tlda = r and p = r
status = cublasDcopy(cublas_H, n, r, 1, r_tlda, 1) ! r_tlda = r
status = cublasDcopy(cublas_H, n, r, 1, p, 1) ! p = r
status = cublasDnrm2(cublas_H, n, r, 1, nrmr0) ! nrmr0 = ||r||2
!$acc end host_data
print *,"nrmr0=",nrmr0, "tol*nrmr0=",tol*nrmr0
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0)
! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoL/descrL
cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoL)
! the analysis phase of the solution of a sparse linear system for infoU/descrU
cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoU)
! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix
! output : valsILU0
cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descrA, valsILU0, &
csrRowPtrA, csrColIndA, infoL);
!$acc end host_data
status = end_clock
time_tag = "The analysis phase of ILU(0)"
status = print_time(time_tag)
status = start_clock
do while ( k <= max_iter )
k = k + 1
rho_p = rho
!$acc host_data use_device(r_tlda, r)
status = cublasDdot(cublas_H, n, r_tlda, 1, r, 1, rho) ! rho = (r_tlda, r)
!$acc end host_data
!!if ( rho == 0.0 ) then
!! print *,"rho is zero, BiCG calculation condition does not hold. STOP!", k, "times"
!! exit
!!end if
if ( k > 1 ) then
!!if ( omega == 0.0 ) then
!! print *,"omega is zero, BiCG calculation condition does not hold. STOP!", k, "times"
!! exit
!!end if
beta = (rho/rho_p) * (alpha/omega)
omegam1 = -omega
!$acc host_data use_device(q, p, r)
cublas_status = cublasDaxpy(cublas_H, n, omegam1, q, 1, p, 1) ! p = p -omega * q
cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta * p
cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1) ! p = p + r
!$acc end host_data
end if
! Solve using valsILU0 vector from Incomplete-LU factorization, Solve M z = p
!$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, p, Lov, z)
! Forwadaard Solve, L part for infoA(the sparsity pattern of descL)
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
valsILU0, csrRowPtrA, csrColIndA, infoL, p, Lov)
! Back Substitution (U part for infoU = descU)
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z) ! solve z
!$acc end host_data
! Compute q = A z
!$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r_tlda, z, q)
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, z, zero, q) ! q = A z
status = cublasDdot(cublas_H, n, r_tlda, 1, q, 1, temp1) ! temp1 = (r_tlda, q)
!$acc end host_data
alpha = rho / temp1 ! alpha = rho / (r_tlda, q)
alpham1 = -alpha
!$acc host_data use_device(q, z, X, r)
cublas_status = cublasDaxpy(cublas_H, n, alpham1, q, 1, r, 1) ! r = r - alpha * q
cublas_status = cublasDaxpy(cublas_H, n, alpha, z, 1, X, 1) ! X = X + alpha * z
cublas_status = cublasDnrm2(cublas_H, n, r, 1, nrmr) ! nrmr = || r ||2
!$acc end host_data
! check for convergence
if ( nrmr < tol*nrmr0 ) exit
! solve M s = r
!$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, s, t)
! Forward Solve, L part for infoA(the sparsity pattern of descL)
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov)
! Back Substitution (U part for infoU = descU)
cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, s) ! solve s
! Compute t = A s
cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & ! t = A s
n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, s, zero, t)
status = cublasDdot(cublas_H, n, t, 1, r, 1, temp1) ! temp1 = (t, r)
status = cublasDdot(cublas_H, n, t, 1, t, 1, temp2) ! temp2 = (t, t)
!$acc end host_data
omega = temp1/temp2
omegam1 = -omega
!$acc host_data use_device(s, t, X, r)
status = cublasDaxpy(cublas_H, n, omega, s, 1, X, 1) ! X = X + omega * s
status = cublasDaxpy(cublas_H, n, omegam1, t, 1, r, 1) ! r = r - omega * t
status = cublasDnrm2(cublas_H, n, r, 1, nrmr) ! nrmr = || r ||2
!$acc end host_data
! check for convergence
if ( nrmr < tol*nrmr0 ) exit
if (idbg == 1 ) print *, "iteration=", k, " residual =", nrmr
end do
!$acc end data
print *, "iteration=", k, " residual =", nrmr
if ( nrmr < tol*nrmr0 ) then
print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is passed"
else
print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is failed"
end if
status = end_clock
time_tag = "Elapased time for BiCGStab with ILU(0) :"
status = print_time(time_tag)
! check result
error = 0.0
do i = 1, n
rsum = 0.0
do j = csrRowPtrA(i), csrRowPtrA(i+1)-1
rsum = rsum + csrValA(j) * X(csrColIndA(j))
end do
diff = abs(rsum - rhs(i))
if (diff > error) error = diff
end do
print *,"error =",error
! output result
if (idbg == 2) then
nprint = n
if ( n > 10 ) nprint = 10
do i = 1, nprint
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
print *,""
do i = n-nprint, n
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
end if
status = cublasDestroy(cublas_H)
status = cusparseDestroy(cusparse_H)
status = cusparseDestroyMatDescr(descrA)
status = cusparseDestroyMatDescr(descrL)
status = cusparseDestroyMatDescr(descrU)
status = cusparseDestroySolveAnalysisInfo(infoA)
status = cusparseDestroySolveAnalysisInfo(infoL)
status = cusparseDestroySolveAnalysisInfo(infoU)
deallocate( q, r, r_tlda, p )
deallocate( Lov, z, valsILU0, t, s )
end subroutine BiCGStab
End module CG_routines
program main
use precision_m
use CG_routines
implicit none
integer :: num_mat
integer :: i, j
!-------------------------------------------------------------
! A matrix : sparse CSR type matrix
! N x N symmetric positive definite matrix (num_mat)
!-------------------------------------------------------------
num_mat = 16
! num_mat = 16384
num_mat = 1048576
! num_mat = 4000000
print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter
! Select matrix data
! call tridaiag_matgen (num_mat)
call laplace_2nd_matgen (num_mat)
! Print CSR Matrix data
! print *, "csrValA", csrValA
! print *, "csrRowPtrA",csrRowPtrA
! print *, "csrColIndA", csrColIndA
! Preconditioned Conjugate Gradient using Incomplete-LU
! factorization with 0 fill-in and no pivoting
call BiCGStab (num_mat)
end program main
コンパイルオプションとして、-acc -Mcudalib=cusparse,cublas -Mcuda オプションを付けてコンパイルリンクする必要がある。例示しているプログラムで使用しているテスト用の係数行列は正定値対称行列である。非対称行列にも対応するため、これについては、後述する。なお、PGI 17.9 から CUDA Toolkit 9.0 ルーチンがコンパイラシステムに統合されたため、-ta=cuda9.0 を指定した。この指定は、cuda7.5 or cuda8.0 でも構わない。
$ pgf90 main.f90 -Mcudalib=cublas,cusparse -acc -ta=tesla,cc60,cuda9.0,cc35 -Minfo=accel -Mcuda -O2
cg:
255, Generating create(q(:))
Generating copy(r(:),x(:))
Generating copyin(csrrowptra(:),csrcolinda(:),csrvala(:))
Generating create(p(:))
iclu_cg:
457, Generating create(q(:))
Generating copyin(valsilu0(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
iccl_cg:
691, Generating create(q(:))
Generating copyin(valscl(:))
Generating create(r_p(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating create(lov(:),p(:))
Generating copyin(csrvala(:))
Generating copy(x(:),r(:))
Generating create(z_p(:),z(:))
[kato@photon32 CG]$ a.out
tol = .1000000000D-12 max_iteration = 10000
Matrix size = 1048576 x 1048576
Check a number of non-zero element
idx = 5238785 nz = 5238784
----------------------------------------------
Bi-Conjugate Gradient Stabilized (BiCGStab)
----------------------------------------------
nrmr0= 32.00000000000000 tol*nrmr0= 3.2000000000000001E-012
The analysis phase of ILU(0) 0.625 (sec)
iteration= 981 residual = 3.0842327270017723E-012
Bi-Conjugate Gradient Stabilized (BiCGStab) is passed
Elapased time for BiCGStab with ILU(0) : 39.010 (sec)
error = 2.6423307986078726E-014
x( 1)=0.499998958505E+00
x( 2)=0.697650643317E+00
x( 3)=0.790607780743E+00
x( 4)=0.841500545499E+00
x( 5)=0.872868905925E+00
x( 6)=0.893963985576E+00
x( 7)=0.909079245320E+00
x( 8)=0.920430160460E+00
x( 9)=0.929263761574E+00
x( 10)=0.936332649288E+00
x(1048566)=0.114542005404E-04
x(1048567)=0.104132673427E-04
x(1048568)=0.937223193541E-05
x(1048569)=0.833110453556E-05
x(1048570)=0.728989536154E-05
x(1048571)=0.624861463278E-05
x(1048572)=0.520727256966E-05
x(1048573)=0.416587939333E-05
x(1048574)=0.312444532565E-05
x(1048575)=0.208298058897E-05
x(1048576)=0.104149540607E-05