OpenACC プログラミング by PGI　-- 12-2章 CUDA Library/OpenACC を利用した線形方程式の反復解法--

　大規模なスパース性を有する線形システムの解法は、計算科学、工学のアプリケーション分野において重要な位置づけであり、これらは一般に直接法や反復法の手法が使用されている。直接法を利用した解法では、信頼性は高いがメモリ容量の問題により大規模な問題が解けないことも多く、従来から反復法を用いた解法が利用されている。反復法には様々な解法が存在するが、ここでは、正定値対称行列を扱える Conjugate Gradient(CG) 法と非対称線形システムを対象とする Bi-Conjugate Gradient Stabilized（BiCGStab）法を CUDA Library(cuSPARSE, cuBLAS) を利用して OpenACC プログラミングモデルで作成したプログラムを例示する。なお、使用言語は Fortran とする。

　ここで例示するプログラムのアルゴリズムは以下の参考文献を参照し Fortran で組んだものである。

　プログラムは、基本的に反復法のアルゴリズムに対して CUDA cuSPARSE/cuBLAS ライブラリを使用してコーディングする形態となる。OpenACC ディレクティブは、デバイス側のデータとなり得る変数・配列に対してディレクティブにより指示を行うだけの用途で使用する。すなわち、OpenACC により配列、変数のデバイスデータのマネージメントだけを行うこと以外は、一般的な Fortran プログラムとなる。CUDAプログラム言語は一切使用しない。
　使用する OpenACC のディレクティブは、デバイスデータが使用されるスコープ範囲を指定する data 構文を使用することと、CUDA ライブラリルーチンをコールする際に渡す実引数が「デバイスポインタ」であることを、コンパイルに伝えるために使用する、host_data use_device ディレクティブの利用の二つとなる。

　以下の例は、PGI 17.9 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V
pgfortran 17.9-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

　ここで例示する Fortran プログラムには、(1) 前処理なしの CG アルゴリズムを適用したルーチン(subroutine CG)、(2) 不完全 LU 分解の前処理を施した CG ルーチン(subroutine ICLU_CG)、(3) 不完全コレスキー分解の前処理を施した CG ルーチン(subroutine ICCL_CG) を組み込んだ。いずれも、cuBLAS と cuSPARSE ルーチンを使用して CG 実行を実現している。以下のアルゴリズムは、不完全コレスキー分解の前処理を適用した場合の処理となる。不完全 LU 分解を適用する場合も、分解方法が異なるが基本的に同じ流れとなる。

参照アルゴリズム　NVIDIA CUDA Toolkit Documentation

$\begin{aligned} 1 : & 初期値設定 x_{0} 初期計算 r \leftarrow f - A x_{0} \end{aligned}$ $\begin{aligned} 2 : & for i \leftarrow 1, 2, ... 収束するまで繰り返す do \\ 3 : & Solve M z \leftarrow r & ⊳ Sparse lower and upper triangular solves \\ 4 : & ρ_{i} \leftarrow r^{T} z \\ 5 : & if i == 1 then \\ 6 : & p \leftarrow z \\ 7 : & else \\ 8 : & β \leftarrow \frac{ρ_{i}}{ρ_{i - 1}} \\ 9 : & p \leftarrow z + β p \\ 10 : & end if \\ 11 : & Compute q \leftarrow A p & ⊳ Sparse matrix-vector multiplication \\ 12 : & α \leftarrow \frac{ρ_{i}}{p^{T} q} \\ 13 : & x \leftarrow x + α p \\ 14 : & r \leftarrow r - α q \\ 15 : & end for \end{aligned}$

　入力する係数行列のストレージ・フォーマットは、CSR 形式としている。この詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。CSR 形式のデータを用意して、このプログラムに入力するインタフェースを作成すれば、各種 CG 実行が可能である。

　係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。

nz	整数	行列内の非ゼロ要素の数
csrValA	配列	行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format.
csrRowPtrA	配列	配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA（0）を含みます。一般に、csrRowPtrA（0）は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる
csrColIndA	配列	配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列

Preconditioned Iterative Methods using OpenACC, cuSPARSE, cuBLAS

Incomplete-LU preconditioned CG iterative method
Incomplete-Cholesky preconditioned CG iterative method

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module precision_m
!   integer, parameter, public :: fp_kind = kind(0.0)                  ! Single precision
    integer, parameter, public :: fp_kind = selected_real_kind(15,305) ! Double precision
end module precision_m
module measure_time
    integer :: clock_start,clock_end,clock_rate
    character*40 time_tag
    contains
    integer function init_clock
      call system_clock(count_rate=clock_rate) ! Find the rate
      clock_start = 0; clock_end = 0
    end function
    integer function start_clock
      call system_clock(count=clock_start) ! Start timing
    end function
    integer function end_clock
      call system_clock(count=clock_end)   ! End timing
    end function
    integer function print_time(time_tag)
      character*40 time_tag
      print '(/1x,a,2x,f10.3,a/)', time_tag, &
        REAL(clock_end-clock_start,kind(0d0))/REAL(clock_rate,kind(0d0)), " (sec)"
    end function
end module measure_time

module CG_routines
    use precision_m
    implicit none
    integer                  :: nprint
    integer, parameter       :: idbg = 2
    integer, parameter       :: max_iter = 10000
    real(fp_kind), parameter :: tol = 1.0d-12
    integer, parameter       :: index_base = 1
    ! CSR format 
    integer  :: nz                           ! nonzero element of A (m x n ; n = m)
    integer, allocatable, dimension(:)       :: csrRowPtrA  ! m + 1
    integer, allocatable, dimension(:)       :: csrColIndA  !
    real(fp_kind), allocatable, dimension(:) :: csrValA
!
    real(fp_kind), allocatable, dimension(:) :: X
    real(fp_kind), allocatable, dimension(:) :: rhs

  contains
    subroutine tridaiag_matgen(n)
!-------------------------------------------------------------
!   genTridiag: generate a random tridiagonal symmetric matrix
!   A matrix : sparse CSR storage format
!   N x N  symmetric positive definite matrix
!-------------------------------------------------------------
    integer       :: n ! size of matrix A
    integer       :: i, js
    real(fp_kind) :: rnd, rand
    external rand


    nz = (n-2)*3 + 4    ! non-zero elements
    allocate( csrRowPtrA(n+1) )
    allocate( csrColIndA(nz)  )
    allocate( csrValA(nz)     )

!   matrix A
    call random_number(rnd)
    csrRowPtrA(1) = 1  !  i = 1
    csrColIndA(1) = 1 ; csrValA(1) = rnd + 10.0
    csrColIndA(2) = 2 ; csrValA(2) = rnd
    do i = 2, n
      if ( i > 2 ) then
        csrRowPtrA(i) = csrRowPtrA(i-1) + 3
      else
        csrRowPtrA(2) = 3
      end if

      js = (i-2)*3 + 2 + index_base
      csrColIndA(js)   = i - 1
      csrColIndA(js+1) = i
      csrColIndA(js+2) = i + 1

      call random_number(rnd)
      csrValA(js) =  csrValA(js-1)
      csrValA(js+1) = rnd + 10.0

      if ( i < n )  then
        call random_number(rnd)
        csrValA(js+2) = rnd
      end if
    end do
    csrRowPtrA(n+1) = nz + csrRowPtrA(1)

! allocate X, rhs
    allocate( X(n)   )
    allocate( rhs(n) )

! Initilize rhs
    rhs = 1.0

!   check the number of non-zero element
    print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
    print *, " Check a number of non-zero element"
    print *, " nz =", nz

    end subroutine tridaiag_matgen

    subroutine laplace_2nd_matgen(n)
!-------------------------------------------------------------
!   genTridiag: generate a tridiagonal symmetric matrix
!   A matrix : sparse CSR storage format
!   a second order, regular, Laplacian operator
!-------------------------------------------------------------
    integer       :: n ! size of matrix A
    integer       :: n_route
    integer       :: i, idx
    integer       :: ix, iy

    if ( n /= sqrt(real(n))*sqrt(real(n)) )  stop "N number"

    nz = 5*N-4*int(sqrt(real(n)))   ! non-zero elements
    allocate( csrRowPtrA(1:n+1) )
    allocate( csrColIndA(1:nz)  )
    allocate( csrValA(1:nz)     )

! allocate X, rhs
    allocate( rhs(1:n) ) ; rhs = 0.0 ! Initialize RHS
    n_route = sqrt(real(n))          ! laplace dimension

    idx = 1
!   matrix A
    do i = 1, n
      ix = mod(i-1,n_route)
      iy = (i-1)/n_route
      !print *,"ix,iy=",ix,iy

      csrRowPtrA(i) = idx

      ! up
      if ( iy > 0 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i - n_route
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 1.0
      end if

      ! left
      if ( ix > 0 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i - 1
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 0.0
      end if

      ! center
        csrValA(idx) = -4.0
        csrColIndA(idx) = i
        idx = idx + 1

      ! right
      if ( ix < n_route-1 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i + 1
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 0.0
      end if

      ! down
      if ( iy < n_route-1 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i + n_route
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 0.0
      end if

    end do
    csrRowPtrA(n+1) =  idx

!   check the number of non-zero element
    print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
    print *, " Check a number of non-zero element"
    print *, " idx =", idx, "nz =", nz

! allocate X, rhs
    allocate( X(n) )

    end subroutine laplace_2nd_matgen
!==============================================

    subroutine CG(n)
!----------------------------------------------
!   Conjugate gradient without preconditioning
!----------------------------------------------
    use precision_m
    use measure_time

    use cublas_v2
    use cusparse
    implicit none

    integer                  :: n ! size of matrix

    type(cusparseHandle)     :: cusparse_H
    type(cublasHandle)       :: cublas_H
    type(cusparseMatDescr)   :: descrA
    integer                  :: cusparse_status
    integer                  :: cublas_status
    integer                  :: status, ierr_code
    real(fp_kind)            :: r0, r1, alpha, alpham1, beta
    real(fp_kind)            :: dot, nalpha
    real(fp_kind)            :: error, rsum, diff
    real(fp_kind), allocatable, dimension(:) :: r
    real(fp_kind), allocatable, dimension(:) :: p
    real(fp_kind), allocatable, dimension(:) :: q
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0

    integer                :: i, j, k

    allocate( r(n), p(n) )
    allocate( q(n) )

! Conjugate gradient without preconditioning

    print *,"-------------------------------------------------------------"
    print *,"Conjugate gradient without preconditioning"
    print *,"-------------------------------------------------------------"

! Initilize X, rhs
    X = 0.0  ! or Initial guess X is set

! Copy rhs to r (Initial value r for rhs which is "b" vector)
    do  i = 1, n
       r(i) = rhs(i)  ! set r = rhs, where Ax = rhs (Ax = b)
    end do

! creater handle for cuBLAS
    status = cublasCreate(cublas_H)

! creater handle for CUSPARSE and matrix A descriptor
    status = cusparseCreate(cusparse_H)
    if (status /= CUSPARSE_STATUS_SUCCESS) write(*,*) 'cusparseCreate error: ', status
    status = cusparseCreateMatDescr(descrA)

! matrix A properties
    status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE)  ! base = 1

    alpha   = 1.0
    beta    = 0.0

    k = 0
    r0 = 0.0

       status = init_clock
       status = start_clock

!$acc data copy(r, X) copyin(csrValA, csrRowPtrA, csrColIndA) &
!$acc create(p, q)

   ! === Initial guess x0 and residual r1 = r - Ax0
   !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, q, X, r)
    cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                      n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q)  ! q = A x0
    cublas_status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1)                            ! r = r - q
    cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1)                               ! r1= (r, r)
   !$acc end host_data

    do while ( r1 > tol*tol .and. k < max_iter )
      k = k + 1

      if ( k == 1 ) then

      !$acc host_data use_device(r,p)
        cublas_status = cublasDcopy(cublas_H, n, r, 1, p, 1)              ! p = r
      !$acc end host_data

      else

        beta = r1/r0;                                      ! beta = r(i) / r(i-1)
                                                           ! p = r + beta * p
      !$acc host_data use_device(r,p)
        cublas_status = cublasDscal(cublas_H, n, beta, p, 1)       ! p = beta*p
        cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1)  ! p = r + p
      !$acc end host_data

      end if

      ! Compute q = A p
      !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, q ,p)
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                          n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, p, zero, q)
        cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, dot)   ! dot = p{T} * q
      !$acc end host_data

        alpha = r1/dot

      ! x = x + alpha*p
      !$acc host_data use_device(p, X)
        cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1)
      !$acc end host_data

        nalpha = -alpha
      ! r = r - alpha*q
      !$acc host_data use_device(q, r)
        cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1)
      !$acc end host_data

        r0 = r1                                                   ! r0 = r1
      !$acc host_data use_device(r)
        cublas_status = cublasDdot(cublas_H, n, r, 1, r, 1, r1)   ! r1 = (r, r)
      !$acc end host_data

        if (idbg == 1 ) print *, "iteration=", k,  " residual =", sqrt(r1)
    end do
!$acc end data

    status = end_clock

    print *, "iteration=", k,  " residual =", sqrt(r1)
    if ( sqrt(r1) < tol ) then
      print *, "CG without preconditioning test is passed"
    else
      print *, "CG without preconditioning test is failed"
    end if

        time_tag = "Elapased time CG without preconditioning : "
        status = print_time(time_tag)

! print *, "X=",X
! check result
    error = 0.0
    do i = 1, n
      rsum = 0.0
      do  j =  csrRowPtrA(i), csrRowPtrA(i+1)-1
         rsum = rsum + csrValA(j) * X(csrColIndA(j))
      end do
      diff = abs(rsum - rhs(i))
      if (diff > error) error = diff
    end do
    print *,"error =",error

! output result
    if (idbg == 2) then
      nprint = n
      if ( n > 10 ) nprint = 10
      do i = 1, nprint
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
      print *,""
      do i = n-nprint, n
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
    end if

    status = cublasDestroy(cublas_H)
    status = cusparseDestroy(cusparse_H)
    deallocate( q, r,  p )

    end subroutine CG
!======================================================

    subroutine ICLU_CG(n)
!------------------------------------------------------
!   Incomplete-LU preconditioned CG -- ILU(0)
!   0 fill-in and no pivoting
!   A is an m × m sparse matrix defined in CSR storage format
!------------------------------------------------------
    use precision_m
    use measure_time

    use cublas_v2
    use cusparse
    implicit none

    integer                  :: n ! size of matrix

    type(cusparseHandle)     :: cusparse_H
    type(cublasHandle)       :: cublas_H
    type(cusparseMatDescr)   :: descr2
    type(cusparseMatDescr)   :: descrL
    type(cusparseMatDescr)   :: descrU
    type(cusparseSolveAnalysisInfo) :: infoSolA, infoILU
    integer                  :: cusparse_status
    integer                  :: cublas_status
    integer                  :: status, ierr_code
    real(fp_kind)            :: r1, alpha, alpham1, beta
    real(fp_kind)            :: abc, efg
    real(fp_kind)            :: nalpha
    real(fp_kind)            :: error, rsum, diff
    real(fp_kind), allocatable, dimension(:) :: valsILU0
    real(fp_kind), allocatable, dimension(:) :: Lov, z, r_p, z_p
    real(fp_kind), allocatable, dimension(:) :: r
    real(fp_kind), allocatable, dimension(:) :: p
    real(fp_kind), allocatable, dimension(:) :: q
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0

    integer                :: i, j, k

    allocate( r(n), p(n) )
    allocate( q(n) )
    allocate( valsILU0(nz) )
    allocate( Lov(n), z(n), r_p(n) ,z_p(n) )

! Conjugate gradient using incomplete-LU

    print '(/1x,a)',"-------------------------------------------------------------"
    print '(1x,a)', "Preconditioned Conjugate Gradient using incomplete-LU(0)"
    print *,        "-------------------------------------------------------------"

! Initilize X, rhs
    X = 0.0  ! or Initial guess X is set

! Copy rhs to r (Initial value r for rhs which is "b" vector)
    do  i = 1, n
       r(i) = rhs(i)     ! set r = rhs, where Ax = rhs (Ax = b)
    end do

    alpha   = 1.0
    beta    = 0.0
    k = 0

! Clock
      status = init_clock
      status = start_clock

! creater handle for cuBLAS
    status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
    status = cusparseCreate(cusparse_H)

! matrix A descriptor(descr2) and properties
    status = cusparseCreateMatDescr(descr2)
    status = cusparseSetMatType(descr2, CUSPARSE_MATRIX_TYPE_GENERAL)  ! must be GENERAL for using csrilu0
    status = cusparseSetMatIndexBase(descr2, CUSPARSE_INDEX_BASE_ONE)  ! base = 1

! initiialize the analysis info for A matrix (InfoSolA = sparsity pattern of A)
    status = cusparseCreateSolveAnalysisInfo(infoSolA)

! initiialize the ILU0 preconditioner Info (infoILU)
    status = cusparseCreateSolveAnalysisInfo(infoILU)

! Setup for Lower Matrix
    status = cusparseCreateMatDescr (descrL)
    status = cusparseSetMatType     (descrL, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
    status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT)

! Setup for Upper Matrix
    status = cusparseCreateMatDescr (descrU)
    status = cusparseSetMatType     (descrU, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER)
    status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT)

! copy original vals to new ILU0-vals
    valsILU0 = csrValA

!$acc data copy(r, X) copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA) &
!$acc create(p, q, Lov, z, r_p, z_p)

   ! === Initial guess x0 and residual r1 = r - Ax0
   !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, q, r)
     cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                       n, n, nz, one, descr2, csrValA, csrRowPtrA, csrColIndA, X, zero, q)   ! q = A x0
     status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1)                                    ! r = r - q
     status = cublasDdot (cublas_H, n, r, 1, r, 1, r1)                                       ! r1 = (r, r)
   !$acc end host_data

   !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0)
    ! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoSolA/descr2
    cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                     n, nz, descr2, csrValA, csrRowPtrA, csrColIndA, infoSolA)

    ! the analysis phase of the solution of a sparse linear system for infoILU0/descrU
    cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                     n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoILU)

    ! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix
    ! output : valsILU0
    cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descr2, valsILU0, &
                     csrRowPtrA, csrColIndA, infoSolA);
   !$acc end host_data

        status = end_clock
        time_tag = "The analysis phase of ILU(0)"
        status = print_time(time_tag)
        status = start_clock

    do while ( r1 > tol*tol .and. k <= max_iter )

   ! Solve using valsILU0 vector from Incomplete-LU factorization
   !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, z)
      ! Forward Solve, L part for infoSolA(the sparsity pattern of A = descL)
      cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoSolA, r, Lov)
      ! Back Substitution (U part for infoILU = descU)
      cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoILU, Lov, z)  ! solve z
   !$acc end host_data

      k = k + 1
      if ( k == 1 ) then

      !$acc host_data use_device(z, p)
        cublas_status = cublasDcopy(cublas_H, n, z, 1, p, 1)            !  p = z
      !$acc end host_data

      else

      !$acc host_data use_device(r, z, r_p, z_p)
        cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc)        ! abc = (r, z)     r(i)
        cublas_status = cublasDdot(cublas_H, n, r_p, 1, z_p, 1, efg)    ! efg = (r_p, z_p) r(i-1)
      !$acc end host_data

        beta = abc / efg           ! beta = r(i) / r(i-1)

      ! p = z +  beta * p
      !$acc host_data use_device(z,p)
        cublas_status = cublasDscal(cublas_H, n, beta, p, 1)            ! p = beta*p
        cublas_status = cublasDaxpy(cublas_H, n, one, z, 1, p, 1)       ! p = z + p
      !$acc end host_data

      end if

      ! Compute q = A p
      !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r, z, p, q)
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                          n, n, nz, one, descrU, csrValA, csrRowPtrA, csrColIndA, p, zero, q);
        cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc)       ! abc = (r, z)
        cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, efg)       ! efg = (p, q)
      !$acc end host_data

        alpha = abc / efg
        nalpha = -alpha

      !$acc host_data use_device(p, X, r, r_p, z, z_p, q)
        cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1)    ! X = X + alpha * p
        cublas_status = cublasDcopy(cublas_H, n, r, 1, r_p, 1)         ! r_p = r
        cublas_status = cublasDcopy(cublas_H, n, z, 1, z_p, 1)         ! z_p = z
        cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1)   ! r = r - alpha * q
        cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1)       ! r1 = (r, r)
      !$acc end host_data

        if (idbg == 1 ) print *, "iteration=", k,  " residual =", sqrt(r1)
    end do
!$acc end data

    print *, "iteration=", k,  " residual =", sqrt(r1)
    if ( sqrt(r1) < tol ) then
      print *, "Preconditioned Conjugate Gradient using ILU is passed"
    else
      print *, "Preconditioned Conjugate Gradient using ILU is failed"
    end if

        status = end_clock
        time_tag = "Elapased time for GG with ILU(0) :"
        status = print_time(time_tag)

! check result
    error = 0.0
    do i = 1, n
      rsum = 0.0
      do  j =  csrRowPtrA(i), csrRowPtrA(i+1)-1
         rsum = rsum + csrValA(j) * X(csrColIndA(j))
      end do
      diff = abs(rsum - rhs(i))
      if (diff > error) error = diff
    end do
    print *,"error =",error

! output result
    if (idbg == 2) then
      nprint = n
      if ( n > 10 ) nprint = 10
      do i = 1, nprint
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
      print *,""
      do i = n-nprint, n
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
    end if

    status = cublasDestroy(cublas_H)
    status = cusparseDestroy(cusparse_H)
    status = cusparseDestroyMatDescr(descr2)
    status = cusparseDestroyMatDescr(descrL)
    status = cusparseDestroyMatDescr(descrU)
    status = cusparseDestroySolveAnalysisInfo(infoSolA)
    status = cusparseDestroySolveAnalysisInfo(infoILU)
    deallocate( q, r,  p )
    deallocate( Lov, z, valsILU0, r_p, z_p )

    end subroutine ICLU_CG
!======================================================

    subroutine ICCL_CG(n)
!------------------------------------------------------
!   Incomplete-Cholesky preconditioned CG -- IC(0)
!   0 fill-in and no pivoting
!   A is an m × m Hermitian/symmetric positive definite sparse matrix
!   defined in CSR storage format
!------------------------------------------------------
    use precision_m
    use measure_time

    use cublas_v2
    use cusparse
    implicit none

    integer                  :: n ! size of matrix

    type(cusparseHandle)     :: cusparse_H
    type(cublasHandle)       :: cublas_H
    type(cusparseMatDescr)   :: descr3
    type(cusparseMatDescr)   :: descrL
    type(cusparseSolveAnalysisInfo) :: infoSolA, infoCL, infoCLt
    integer                  :: cusparse_status
    integer                  :: cublas_status
    integer                  :: status, ierr_code
    real(fp_kind)            :: r1, alpha, alpham1, beta
    real(fp_kind)            :: abc, efg
    real(fp_kind)            :: nalpha
    real(fp_kind)            :: error, rsum, diff
    real(fp_kind), allocatable, dimension(:) :: valsCL
    real(fp_kind), allocatable, dimension(:) :: Lov, z, r_p, z_p
    real(fp_kind), allocatable, dimension(:) :: r
    real(fp_kind), allocatable, dimension(:) :: p
    real(fp_kind), allocatable, dimension(:) :: q
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0

    integer                :: i, j, k

    allocate( r(n), p(n) )
    allocate( q(n) )
    allocate( valsCL(nz) )
    allocate( Lov(n), z(n), r_p(n) ,z_p(n) )

! Conjugate gradient using incomplete-Cholesky for Symmetric matrix

    print '(/1x,a)',"-------------------------------------------------------------"
    print '(1x,a)', "Preconditioned Conjugate Gradient using incomplete-Cholesky(0)"
    print *,        "-------------------------------------------------------------"

! Initilize X, rhs
    X = 0.0  ! or Initial guess X is set

! Copy rhs to r (Initial value r for rhs which is "b" vector)
    do  i = 1, n
       r(i) = rhs(i)     ! set r = rhs, where Ax = rhs (Ax = b)
    end do

    alpha   = 1.0
    beta    = 0.0

    k = 0

! Clock
       status = init_clock
       status = start_clock

! creater handle for cuBLAS
    status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
    status = cusparseCreate(cusparse_H)

! matrix A descriptor(descr3) and properties
    status = cusparseCreateMatDescr(descr3)
    status = cusparseSetMatType(descr3, CUSPARSE_MATRIX_TYPE_GENERAL) ! descr3 as a "GENERAL" or "SYMMETRIC"
    status = cusparseSetMatIndexBase(descr3, CUSPARSE_INDEX_BASE_ONE) ! base = 1

! CAUTION - this program don't generate a lower triangular data for the matrix A
! So, this program don't use a lower triangular data for the incomplete cholesky preconditioner

! initiialize the analysis info for A matrix (InfoSolA = sparsity pattern of A)
    status = cusparseCreateSolveAnalysisInfo(infoSolA)

! initiialize the CL(0) preconditioner Info
    status = cusparseCreateSolveAnalysisInfo(infoCL)
    status = cusparseCreateSolveAnalysisInfo(infoCLt)

! Setup for Lower Matrix
    status = cusparseCreateMatDescr (descrL)
    status = cusparseSetMatType     (descrL, CUSPARSE_MATRIX_TYPE_GENERAL)   ! not TRIANGULAR
    status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
    status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_NON_UNIT)

! copy original vals to new vals-CL
    valsCL = csrValA

!$acc data copy(r, X) copyin(valsCL, csrValA, csrRowPtrA, csrColIndA) &
!$acc create(p, q, Lov, z, r_p, z_p)

   ! === Initial guess x0 and r1 = r - Ax0
   !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, q, r)
     cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                       n, n, nz, one, descr3, csrValA, csrRowPtrA, csrColIndA, X, zero, q);
     status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1)
     status = cublasDdot (cublas_H, n, r, 1, r, 1, r1)                  ! Initial r1 = (r, r)
   !$acc end host_data

   !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA)

    ! the analysis phase of the solution of a sparse linear system for infoCL/descrL
     cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                     n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoCL)
     cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_TRANSPOSE, &
                     n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoCLt)
   !$acc end host_data

        status = end_clock
        time_tag = "The analysis phase of IC(0)"
        status = print_time(time_tag)
        status = start_clock

    do while ( r1 > tol*tol .and. k <= max_iter )

   ! Solve using valsCL vector from Incomplete-Cholesky factorization M z = r
   !$acc host_data use_device(valsCL, csrRowPtrA, csrColIndA, r, Lov, z)
      ! infoCLt triangular solve
      cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_TRANSPOSE, n, one, descrL, &
                                              valsCL, csrRowPtrA, csrColIndA, infoCLt, r, Lov)
      ! InfoCL triangular solve
      cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsCL, csrRowPtrA, csrColIndA, infoCL, Lov, z)  ! solve z
   !$acc end host_data

      k = k + 1
      if ( k == 1 ) then

      !$acc host_data use_device(z, p)
        cublas_status = cublasDcopy(cublas_H, n, z, 1, p, 1)            ! p = z
      !$acc end host_data

      else

      !$acc host_data use_device(r, z, r_p, z_p)
        cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc)        ! abc = (r, z)     r(i)
        cublas_status = cublasDdot(cublas_H, n, r_p, 1, z_p, 1, efg)    ! efg = (r_p, z_p) r(i-1)
      !$acc end host_data

        beta = abc / efg                                                ! beta = r(i) / r(i-1)

      ! p = z +  beta * p
      !$acc host_data use_device(z,p)
        cublas_status = cublasDscal(cublas_H, n, beta, p, 1)            ! p = beta*p
        cublas_status = cublasDaxpy(cublas_H, n, one, z, 1, p, 1)       ! p = z + p
      !$acc end host_data

      end if

      ! Compute q = A p
      !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r, z, p, q)
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                          n, n, nz, one, descr3, csrValA, csrRowPtrA, csrColIndA, p, zero, q);
        cublas_status = cublasDdot(cublas_H, n, r, 1, z, 1, abc)        ! abc = (r, z)
        cublas_status = cublasDdot(cublas_H, n, p, 1, q, 1, efg)        ! efg = (p, q)
      !$acc end host_data

        alpha = abc / efg
        nalpha = -alpha

      !$acc host_data use_device(p, X, r, r_p, z, z_p, q)
        cublas_status = cublasDaxpy(cublas_H, n, alpha, p, 1, X, 1)     ! X = X + alpha * p
        cublas_status = cublasDcopy(cublas_H, n, r, 1, r_p, 1)          ! r_p = r
        cublas_status = cublasDcopy(cublas_H, n, z, 1, z_p, 1)          ! z_p = z
        cublas_status = cublasDaxpy(cublas_H, n, nalpha, q, 1, r, 1)    ! r = r - alpha * q
        cublas_status = cublasDdot (cublas_H, n, r, 1, r, 1, r1)        ! r1 = (r, r)
      !$acc end host_data

        if (idbg == 1 ) print *, "iteration=", k,  " residual =", sqrt(r1)
    end do
!$acc end data

    print *, "iteration=", k,  " residual =", sqrt(r1)
    if ( sqrt(r1) < tol ) then
      print *, "Preconditioned Conjugate Gradient using cholesky is passed"
    else
      print *, "Preconditioned Conjugate Gradient using cholesky is failed"
    end if

        status = end_clock
        time_tag = "Elapased time for GG with IC(0) :"
        status = print_time(time_tag)

! check result
    error = 0.0
    do i = 1, n
      rsum = 0.0
      do  j =  csrRowPtrA(i), csrRowPtrA(i+1)-1
         rsum = rsum + csrValA(j) * X(csrColIndA(j))
      end do
      diff = abs(rsum - rhs(i))
      if (diff > error) error = diff
    end do
    print *,"error =",error

! output result
    if (idbg == 2) then
      nprint = n
      if ( n > 10 ) nprint = 10
      do i = 1, nprint
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
      print *,""
      do i = n-nprint, n
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
    end if

    status = cublasDestroy(cublas_H)
    status = cusparseDestroy(cusparse_H)
    status = cusparseDestroyMatDescr(descr3)
    status = cusparseDestroyMatDescr(descrL)
    status = cusparseDestroySolveAnalysisInfo(infoSolA)
    status = cusparseDestroySolveAnalysisInfo(infoCL)
    status = cusparseDestroySolveAnalysisInfo(infoCLt)
    deallocate( q, r, p )
    deallocate( Lov, z, valsCL, r_p, z_p )

    end subroutine ICCL_CG

End module CG_routines

program main
  use precision_m
  use CG_routines
  implicit none

  integer                :: num_mat
  integer                :: i, j

!-------------------------------------------------------------
!   A matrix : sparse CSR type matrix
!   N x N  symmetric positive definite matrix (num_mat)
!-------------------------------------------------------------

  num_mat = 16
  num_mat = 16384
! num_mat = 1048576
! num_mat = 4000000

  print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter

! Select matrix data

! call tridaiag_matgen    (num_mat)
  call laplace_2nd_matgen (num_mat)

! Print CSR Matrix data
!    print *, "csrValA", csrValA
!    print *, "csrRowPtrA",csrRowPtrA
!    print *, "csrColIndA", csrColIndA

! Conjugate gradient without preconditioning

  call CG (num_mat)

! Preconditioned Conjugate Gradient using Incomplete-LU
! factorization with 0 fill-in and no pivoting

  call ICLU_CG (num_mat)

! Preconditioned Conjugate Gradient using Incomplete-Cholesky
! factorization

  call ICCL_CG (num_mat)

end program main

　コンパイルオプションとして、-acc -Mcudalib=cusparse,cublas -Mcuda オプションを付けてコンパイルリンクする必要がある。例示しているプログラムで使用しているテスト用の係数行列は、極めて収束性の良いデータを使用しているため、前処理なしの CG 計算が速い結果となっているが、実用上での係数行列の特性は固有値分布が様々なため、 CG 計算では収束出来ない場合も生じる。また、「前処理」を実施したとしても係数行列の素性によっては収束できない場合も多い。上記のプログラムは、OpenACC + CUDA library を使用して CG 法の実装モデルを示したにすぎない。実際の利用としては、対称行列に対象を限った不完全コレスキー分解の前処理付きルーチンよりも、不完全 LU 分解の前処理を施した CG ルーチン(subroutine ICLU_CG)を推奨する。あるいは、この後説明する BiCGStab 法を実装した CG ルーチンを利用する方が良いだろう。なお、PGI 17.9から CUDA Toolkit 9.0 ルーチンがコンパイラシステムに統合されたため、-ta=cuda9.0 を指定した。この指定は、cuda7.5 or cuda8.0 でも構わない。

コンパイル＆実行

$ pgf90 main.f90 -Mcudalib=cublas,cusparse -acc -ta=tesla,cc60,cuda9.0,cc35 -Minfo=accel -Mcuda -O2
cg:
    255, Generating create(q(:))
         Generating copy(r(:),x(:))
         Generating copyin(csrrowptra(:),csrcolinda(:),csrvala(:))
         Generating create(p(:))
iclu_cg:
    457, Generating create(q(:))
         Generating copyin(valsilu0(:))
         Generating create(r_p(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating copy(x(:),r(:))
         Generating create(z_p(:),z(:))
iccl_cg:
    691, Generating create(q(:))
         Generating copyin(valscl(:))
         Generating create(r_p(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating copy(x(:),r(:))
         Generating create(z_p(:),z(:))
[kato@photon32 CG]$ a.out

  tol = .1000000000D-11  max_iteration =   10000

  Matrix size =    1048576  x    1048576
  Check a number of non-zero element
  idx =      5238785 nz =      5238784
 -------------------------------------------------------------
 Conjugate gradient without preconditioning
 -------------------------------------------------------------
 iteration=         3531  residual =   9.9759469342430417E-013
 CG without preconditioning test is passed

 Elapased time CG without preconditioning       2.616 (sec)

 error =   3.7858605139717838E-014
 x(      1)=0.499998958505E+00
 x(      2)=0.697650643317E+00
 x(      3)=0.790607780743E+00
 x(      4)=0.841500545499E+00
 x(      5)=0.872868905925E+00
 x(      6)=0.893963985576E+00
 x(      7)=0.909079245320E+00
 x(      8)=0.920430160460E+00
 x(      9)=0.929263761574E+00
 x(     10)=0.936332649288E+00

 x(1048566)=0.114542005659E-04
 x(1048567)=0.104132673657E-04
 x(1048568)=0.937223195617E-05
 x(1048569)=0.833110455416E-05
 x(1048570)=0.728989537764E-05
 x(1048571)=0.624861464650E-05
 x(1048572)=0.520727258128E-05
 x(1048573)=0.416587940269E-05
 x(1048574)=0.312444533254E-05
 x(1048575)=0.208298059361E-05
 x(1048576)=0.104149540857E-05

 -------------------------------------------------------------
 Preconditioned Conjugate Gradient using incomplete-LU(0)
 -------------------------------------------------------------

 The analysis phase of ILU(0)                   0.151 (sec)

 iteration=         1237  residual =   9.8237116899063428E-013
 Preconditioned Conjugate Gradient using ILU is passed

 Elapased time for GG with ILU(0) :            24.738 (sec)

 error =   2.3092638912203256E-014
 x(      1)=0.499998958505E+00
 x(      2)=0.697650643317E+00
 x(      3)=0.790607780743E+00
 x(      4)=0.841500545499E+00
 x(      5)=0.872868905925E+00
 x(      6)=0.893963985576E+00
 x(      7)=0.909079245320E+00
 x(      8)=0.920430160460E+00
 x(      9)=0.929263761574E+00
 x(     10)=0.936332649288E+00

 x(1048566)=0.114542005653E-04
 x(1048567)=0.104132673654E-04
 x(1048568)=0.937223195581E-05
 x(1048569)=0.833110455365E-05
 x(1048570)=0.728989537732E-05
 x(1048571)=0.624861464626E-05
 x(1048572)=0.520727258084E-05
 x(1048573)=0.416587940225E-05
 x(1048574)=0.312444533231E-05
 x(1048575)=0.208298059340E-05
 x(1048576)=0.104149540829E-05

 -------------------------------------------------------------
 Preconditioned Conjugate Gradient using incomplete-Cholesky(0)
 -------------------------------------------------------------

 The analysis phase of IC(0)                    0.138 (sec)

 iteration=         1468  residual =   9.9588119511240996E-013
 Preconditioned Conjugate Gradient using cholesky is passed

 Elapased time for GG with IC(0) :             29.747 (sec)

 error =   2.4313884239290928E-014
 x(      1)=0.499998958505E+00
 x(      2)=0.697650643317E+00
 x(      3)=0.790607780743E+00
 x(      4)=0.841500545499E+00
 x(      5)=0.872868905925E+00
 x(      6)=0.893963985576E+00
 x(      7)=0.909079245320E+00
 x(      8)=0.920430160460E+00
 x(      9)=0.929263761574E+00
 x(     10)=0.936332649288E+00

 x(1048566)=0.114542005654E-04
 x(1048567)=0.104132673654E-04
 x(1048568)=0.937223195580E-05
 x(1048569)=0.833110455365E-05
 x(1048570)=0.728989537735E-05
 x(1048571)=0.624861464631E-05
 x(1048572)=0.520727258092E-05
 x(1048573)=0.416587940233E-05
 x(1048574)=0.312444533239E-05
 x(1048575)=0.208298059346E-05
 x(1048576)=0.104149540832E-05

　次に、非対称および対称正定値（spd）線形システムに適用できる、非定常反復法の一つである Bi-Conjugate Gradient Stabilized (BiCGStab) を cuBLAS ならびに cuSPARSE ルーチンを利用して作成した Fortran プログラムを紹介する。OpenACC ディレクティブを使用して、デバイス側のデータとなり得る変数・配列引数を指定するだけで、GPU上で動作するコードが作成できる。CUDA C/Fortran 言語で記述する際のようなデバイス用の変数、配列の割付等は一切不要である。実装アルゴリズムは、以下の NVIDIA Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS で説明されている内容を適用した。

　反復法の収束性を早めるために、不完全 LU 分解による前処理を施している。係数行列の不完全 LU 分解は、0 level fill-in and no pivoting の cusparseDcsrilu0 ルーチンを使用する。これによって、反復計算前に、線形システムの係数行列のスペクトルを変更する。

　入力する係数行列のストレージ・フォーマットは、CSR 形式としている。この詳細は、上述した。CSR 形式のデータを用意して、このプログラムに入力するインタフェースを作成すれば、BiCGStab 実行が可能である。

参照アルゴリズム　NVIDIA CUDA Toolkit Documentation

$\begin{aligned} 1 : & 初期推定値設定 be x_{0}, compute r \leftarrow rhs - A x_{0} \\ 2 : & 初期セット p \leftarrow r and \tilde{r} \leftarrow r \end{aligned}$ $\begin{aligned} 3 : & for i \leftarrow 1, 2, ... 収束するまでループ繰り返し \\ 4 : & ρ_{i} \leftarrow {\tilde{r}}^{T} r \\ 5 : & if ρ_{i} == 0.0 then \\ 6 : & method failed \\ 7 : & end if \\ 8 : & if i > 1 then \\ 9 : & if ω == 0.0 then \\ 10 : & method failed \\ 11 : & end if \\ 12 : & β \leftarrow \frac{ρ_{i}}{ρ_{i - 1}} \times (\frac{α}{ω}) \\ 13 : & p \leftarrow r + β (p - ω v) \\ 14 : & end if \\ 15 : & Solve M \hat{p} \leftarrow p & ⊳ Sparse lower and upper triangular solves \\ 16 : & Compute q \leftarrow A \hat{p} & ⊳ Sparse matrix-vector multiplication \\ 17 : & α \leftarrow \frac{ρ_{i}}{{\tilde{r}}^{T} q} \\ 18 : & s \leftarrow r - α q \\ 19 : & x \leftarrow x + α \hat{p} \\ 20 : & if {‖ s ‖}_{2} \leq tol then \\ 21 : & method converged \\ 22 : & end if \\ 23 : & Solve M \hat{s} \leftarrow s & ⊳ Sparse lower and upper triangular solves \\ 24 : & Compute t \leftarrow A \hat{s} & ⊳ Sparse matrix-vector multiplication \\ 25 : & ω \leftarrow \frac{t^{T} s}{t^{T} t} \\ 26 : & x \leftarrow x + ω \hat{s} \\ 27 : & r \leftarrow s - ω t \\ 28 : & end for \end{aligned}$

! Copyright (c) 2017, Tsutomu Kato, SofTek Systems, Inc. All rights reserved. 
!
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module precision_m
!   integer, parameter, public :: fp_kind = kind(0.0)                  ! Single precision
    integer, parameter, public :: fp_kind = selected_real_kind(15,305) ! Double precision
end module precision_m
module measure_time
    integer :: clock_start,clock_end,clock_rate
    character*40 time_tag
    contains
    integer function init_clock
      call system_clock(count_rate=clock_rate) ! Find the rate
      clock_start = 0; clock_end = 0
    end function
    integer function start_clock
      call system_clock(count=clock_start) ! Start timing
    end function
    integer function end_clock
      call system_clock(count=clock_end)   ! End timing
    end function
    integer function print_time(time_tag)
      character*40 time_tag
      print '(/1x,a,2x,f10.3,a/)', time_tag, &
        REAL(clock_end-clock_start,kind(0d0))/REAL(clock_rate,kind(0d0)), " (sec)"
    end function
end module measure_time

module CG_routines
    use precision_m
    implicit none
    integer                  :: nprint
    integer, parameter       :: idbg = 2
    integer, parameter       :: max_iter = 10000
    real(fp_kind), parameter :: tol = 1.0d-13
    integer, parameter       :: index_base = 1
    integer  :: nz                           ! nonzero element of A (m x n ; n = m)
    integer, allocatable, dimension(:)       :: csrRowPtrA  ! m + 1
    integer, allocatable, dimension(:)       :: csrColIndA  !
    real(fp_kind), allocatable, dimension(:) :: csrValA
!
    real(fp_kind), allocatable, dimension(:) :: X
    real(fp_kind), allocatable, dimension(:) :: rhs

  contains
    subroutine tridaiag_matgen(n)
!-------------------------------------------------------------
!   genTridiag: generate a random tridiagonal symmetric matrix
!   A matrix : sparse CSR storage format
!   N x N  symmetric positive definite matrix
!-------------------------------------------------------------
    integer       :: n ! size of matrix A
    integer       :: i, js
    real(fp_kind) :: rnd, rand
    external rand


    nz = (n-2)*3 + 4    ! non-zero elements
    allocate( csrRowPtrA(n+1) )
    allocate( csrColIndA(nz)  )
    allocate( csrValA(nz)     )

!   matrix A
    call random_number(rnd)
    csrRowPtrA(1) = 1  !  i = 1
    csrColIndA(1) = 1 ; csrValA(1) = rnd + 10.0
    csrColIndA(2) = 2 ; csrValA(2) = rnd
    do i = 2, n
      if ( i > 2 ) then
        csrRowPtrA(i) = csrRowPtrA(i-1) + 3
      else
        csrRowPtrA(2) = 3
      end if

      js = (i-2)*3 + 2 + index_base
      csrColIndA(js)   = i - 1
      csrColIndA(js+1) = i
      csrColIndA(js+2) = i + 1

      call random_number(rnd)
      csrValA(js) =  csrValA(js-1)
      csrValA(js+1) = rnd + 10.0

      if ( i < n )  then
        call random_number(rnd)
        csrValA(js+2) = rnd
      end if
    end do
    csrRowPtrA(n+1) = nz + csrRowPtrA(1)

! allocate X, rhs
    allocate( X(n)   )
    allocate( rhs(n) )

! Initilize rhs
    rhs = 1.0

!   check the number of non-zero element
    print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
    print *, " Check a number of non-zero element"
    print *, " nz =", nz

    end subroutine tridaiag_matgen

    subroutine laplace_2nd_matgen(n)
!-------------------------------------------------------------
!   genTridiag: generate a tridiagonal symmetric matrix
!   A matrix : sparse CSR storage format
!   a second order, regular, Laplacian operator
!-------------------------------------------------------------
    integer       :: n ! size of matrix A
    integer       :: n_route
    integer       :: i, idx
    integer       :: ix, iy

    if ( n /= sqrt(real(n))*sqrt(real(n)) )  stop "N number"

    nz = 5*N-4*int(sqrt(real(n)))   ! non-zero elements
    allocate( csrRowPtrA(1:n+1) )
    allocate( csrColIndA(1:nz)  )
    allocate( csrValA(1:nz)     )

! allocate  rhs
    allocate( rhs(1:n) ) ; rhs =0.0 ! Initialize RHS
    n_route = sqrt(real(n))         ! laplace dimension

    idx = 1
!   matrix A
    do i = 1, n
      ix = mod(i-1,n_route)
      iy = (i-1)/n_route
      !print *,"ix,iy=",ix,iy

      csrRowPtrA(i) = idx

      ! up
      if ( iy > 0 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i - n_route
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 1.0
      end if

      ! left
      if ( ix > 0 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i - 1
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 0.0
      end if

      ! center
        csrValA(idx) = -4.0
        csrColIndA(idx) = i
        idx = idx + 1

      ! right
      if ( ix < n_route-1 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i + 1
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 0.0
      end if

      ! down
      if ( iy < n_route-1 ) then
        csrValA(idx) = 1.0
        csrColIndA(idx) = i + n_route
        idx = idx + 1
      else
        rhs(i) = rhs(i) - 0.0
      end if

    end do
    csrRowPtrA(n+1) =  idx

!   check the number of non-zero element
    print '(1x, a, 1x, i10, 1x , a, i10)', " Matrix size =", n, " x ", n
    print *, " Check a number of non-zero element"
    print *, " idx =", idx, "nz =", nz

! allocate X
    allocate( X(n) )

    end subroutine laplace_2nd_matgen
!=============================================================
    subroutine BiCGStab(n)
!-------------------------------------------------------------
!   Bi-Conjugate Gradient Stabilized (BiCGStab)
!   A is an m × m sparse matrix defined in CSR storage format
!-------------------------------------------------------------
    use precision_m
    use measure_time

    use cublas_v2
    use cusparse
    implicit none

    integer                  :: n ! size of matrix

    type(cusparseHandle)     :: cusparse_H
    type(cublasHandle)       :: cublas_H
    type(cusparseMatDescr)   :: descrA
    type(cusparseMatDescr)   :: descrL
    type(cusparseMatDescr)   :: descrU
    type(cusparseSolveAnalysisInfo) :: infoA, infoL, infoU
    integer                  :: cusparse_status
    integer                  :: cublas_status
    integer                  :: status, ierr_code
    real(fp_kind)            :: alpha, alpham1, beta, omega, omegam1
    real(fp_kind)            :: rho, rho_p
    real(fp_kind)            :: nrmr, nrmr0, temp1, temp2
    real(fp_kind)            :: error, rsum, diff
    real(fp_kind), allocatable, dimension(:) :: valsILU0
    real(fp_kind), allocatable, dimension(:) :: Lov, z
    real(fp_kind), allocatable, dimension(:) :: r, r_tlda
    real(fp_kind), allocatable, dimension(:) :: p
    real(fp_kind), allocatable, dimension(:) :: q
    real(fp_kind), allocatable, dimension(:) :: t, s
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0

    integer                :: i, j, k

    allocate( r(n), p(n), r_tlda(n) )
    allocate( q(n) )
    allocate( valsILU0(nz) )
    allocate( Lov(n), z(n), t(n) , s(n) )

! Bi-Conjugate Gradient Stabilized (BiCGStab) using incomplete-LU

    print '(/1x,a)',"-------------------------------------------------------------"
    print '(1x,a)', " Bi-Conjugate Gradient Stabilized (BiCGStab)"
    print *,        "-------------------------------------------------------------"

! Initilize X, rhs
    X = 0.0              ! or Initial guess X is set

! Copy rhs to r (Initial value r for rhs which is "b" vector)
    do  i = 1, n
       r(i) = rhs(i)     ! set r = rhs, where Ax = rhs (Ax = b)
    end do

    alpha   = 1.0
    beta    = 0.0
    rho     = 0.0
    k = 0

! Clock
      status = init_clock
      status = start_clock

! creater handle for cuBLAS
    status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
    status = cusparseCreate(cusparse_H)

! matrix A descriptor(descrA) and properties
    status = cusparseCreateMatDescr(descrA)
    status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)  ! must be GENERAL for using csrilu0
    status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE)  ! base = 1

! initiialize the analysis info for A matrix (infoA = sparsity pattern of A)
    status = cusparseCreateSolveAnalysisInfo(infoA)

! initiialize the L U preconditioner Info
    status = cusparseCreateSolveAnalysisInfo(infoL)
    status = cusparseCreateSolveAnalysisInfo(infoU)

! Setup for Lower Matrix
    status = cusparseCreateMatDescr (descrL)
    status = cusparseSetMatType     (descrL, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
    status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT)

! Setup for Upper Matrix
    status = cusparseCreateMatDescr (descrU)
    status = cusparseSetMatType     (descrU, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER)
    status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT)

! copy original vals to new ILU0-vals
    valsILU0 = csrValA

!print *,"X=",X
!print *,"r_init=",r

!$acc data copy(r, X)  copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA ) &
!$acc create(p, q, Lov, z, r_tlda ) create(t, s)

   ! === Initial guess x0 and residual r = b - Ax0 where r = b
   !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, r, r_tlda, p )
     cusparse_status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                       n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0, q is temporary
     status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1)                                  ! r = r - A x0
     ! set as r_tlda = r and p = r
     status = cublasDcopy(cublas_H, n, r, 1, r_tlda, 1)                                    ! r_tlda = r
     status = cublasDcopy(cublas_H, n, r, 1, p, 1)                                         ! p      = r
     status = cublasDnrm2(cublas_H, n, r, 1, nrmr0)                                        ! nrmr0 = ||r||2
   !$acc end host_data

    print *,"nrmr0=",nrmr0, "tol*nrmr0=",tol*nrmr0

   !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0)
    ! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoL/descrL
    cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                     n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoL)

    ! the analysis phase of the solution of a sparse linear system for infoU/descrU
    cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                     n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoU)

    ! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix
    ! output : valsILU0
    cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descrA, valsILU0, &
                     csrRowPtrA, csrColIndA, infoL);
   !$acc end host_data

        status = end_clock
        time_tag = "The analysis phase of ILU(0)"
        status = print_time(time_tag)
        status = start_clock

    do while (  k <= max_iter )

      k = k + 1
      rho_p = rho

   !$acc host_data use_device(r_tlda, r)
      status = cublasDdot(cublas_H, n, r_tlda, 1, r, 1, rho)               ! rho = (r_tlda, r)
   !$acc end host_data

    !!if ( rho == 0.0 ) then
    !!  print *,"rho is zero, BiCG calculation condition does not hold. STOP!", k, "times"
    !!  exit
    !!end if

      if ( k > 1 ) then

    !!if ( omega == 0.0 ) then
    !!  print *,"omega is zero, BiCG calculation condition does not hold. STOP!", k, "times"
    !!  exit
    !!end if

        beta = (rho/rho_p) * (alpha/omega)
        omegam1 = -omega

      !$acc host_data use_device(q, p, r)
        cublas_status = cublasDaxpy(cublas_H, n, omegam1, q, 1, p, 1)   ! p = p -omega * q
        cublas_status = cublasDscal(cublas_H, n, beta, p, 1)            ! p = beta * p
        cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1)       ! p = p + r
      !$acc end host_data

      end if

      ! Solve using valsILU0 vector from Incomplete-LU factorization, Solve  M z = p
      !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, p, Lov, z)
      ! Forwadaard Solve, L part for infoA(the sparsity pattern of descL)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoL, p, Lov)

      ! Back Substitution (U part for infoU = descU)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z)  ! solve z
      !$acc end host_data

      ! Compute q = A z
      !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r_tlda, z, q)
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                          n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, z, zero, q)   ! q = A z
        status = cublasDdot(cublas_H, n, r_tlda, 1, q, 1, temp1)        ! temp1 = (r_tlda, q)
      !$acc end host_data

        alpha   = rho / temp1                                           ! alpha = rho / (r_tlda, q)
        alpham1 = -alpha

      !$acc host_data use_device(q, z, X, r)
        cublas_status = cublasDaxpy(cublas_H, n, alpham1, q, 1, r, 1)   ! r = r - alpha * q
        cublas_status = cublasDaxpy(cublas_H, n, alpha,   z, 1, X, 1)   ! X = X + alpha * z
        cublas_status = cublasDnrm2(cublas_H, n, r, 1, nrmr)            ! nrmr = || r ||2
      !$acc end host_data

      ! check for convergence
        if ( nrmr < tol*nrmr0 ) exit

      ! solve M s = r
      !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, s, t)
      ! Forward Solve, L part for infoA(the sparsity pattern of descL)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov)
      ! Back Substitution (U part for infoU = descU)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, s)  ! solve s
      ! Compute t = A s
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &         ! t = A s
                          n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, s, zero, t)

        status = cublasDdot(cublas_H, n, t, 1, r, 1, temp1)             ! temp1 = (t, r)
        status = cublasDdot(cublas_H, n, t, 1, t, 1, temp2)             ! temp2 = (t, t)
      !$acc end host_data

        omega = temp1/temp2
        omegam1 = -omega

      !$acc host_data use_device(s, t, X, r)
        status = cublasDaxpy(cublas_H, n, omega, s, 1, X, 1)            ! X = X + omega * s
        status = cublasDaxpy(cublas_H, n, omegam1, t, 1, r, 1)          ! r = r - omega * t
        status = cublasDnrm2(cublas_H, n, r, 1, nrmr)                   ! nrmr = || r ||2
      !$acc end host_data

      ! check for convergence
        if ( nrmr < tol*nrmr0 ) exit

        if (idbg == 1 ) print *, "iteration=", k,  " residual =", nrmr
    end do
!$acc end data

    print *, "iteration=", k,  " residual =", nrmr
    if ( nrmr < tol*nrmr0 ) then
      print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is passed"
    else
      print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is failed"
    end if

        status = end_clock
        time_tag = "Elapased time for BiCGStab with ILU(0) :"
        status = print_time(time_tag)

! check result
    error = 0.0
    do i = 1, n
      rsum = 0.0
      do  j =  csrRowPtrA(i), csrRowPtrA(i+1)-1
         rsum = rsum + csrValA(j) * X(csrColIndA(j))
      end do
      diff = abs(rsum - rhs(i))
      if (diff > error) error = diff
    end do
    print *,"error =",error

! output result
    if (idbg == 2) then
      nprint = n
      if ( n > 10 ) nprint = 10
      do i = 1, nprint
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
      print *,""
      do i = n-nprint, n
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
    end if

    status = cublasDestroy(cublas_H)
    status = cusparseDestroy(cusparse_H)
    status = cusparseDestroyMatDescr(descrA)
    status = cusparseDestroyMatDescr(descrL)
    status = cusparseDestroyMatDescr(descrU)
    status = cusparseDestroySolveAnalysisInfo(infoA)
    status = cusparseDestroySolveAnalysisInfo(infoL)
    status = cusparseDestroySolveAnalysisInfo(infoU)
    deallocate( q, r, r_tlda, p )
    deallocate( Lov, z, valsILU0, t, s )

    end subroutine BiCGStab

End module CG_routines

program main
  use precision_m
  use CG_routines
  implicit none

  integer                :: num_mat
  integer                :: i, j

!-------------------------------------------------------------
!   A matrix : sparse CSR type matrix
!   N x N  symmetric positive definite matrix (num_mat)
!-------------------------------------------------------------

  num_mat = 16
! num_mat = 16384
  num_mat = 1048576
! num_mat = 4000000

  print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter

! Select matrix data

! call tridaiag_matgen    (num_mat)
  call laplace_2nd_matgen (num_mat)

! Print CSR Matrix data
!    print *, "csrValA", csrValA
!    print *, "csrRowPtrA",csrRowPtrA
!    print *, "csrColIndA", csrColIndA


! Preconditioned Conjugate Gradient using Incomplete-LU
! factorization with 0 fill-in and no pivoting

  call BiCGStab (num_mat)

end program main

　コンパイルオプションとして、-acc -Mcudalib=cusparse,cublas -Mcuda オプションを付けてコンパイルリンクする必要がある。例示しているプログラムで使用しているテスト用の係数行列は正定値対称行列である。非対称行列にも対応するため、これについては、後述する。なお、PGI 17.9 から CUDA Toolkit 9.0 ルーチンがコンパイラシステムに統合されたため、-ta=cuda9.0 を指定した。この指定は、cuda7.5 or cuda8.0 でも構わない。

コンパイル＆実行

$ pgf90 main.f90 -Mcudalib=cublas,cusparse -acc -ta=tesla,cc60,cuda9.0,cc35 -Minfo=accel -Mcuda -O2
cg:
    255, Generating create(q(:))
         Generating copy(r(:),x(:))
         Generating copyin(csrrowptra(:),csrcolinda(:),csrvala(:))
         Generating create(p(:))
iclu_cg:
    457, Generating create(q(:))
         Generating copyin(valsilu0(:))
         Generating create(r_p(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating copy(x(:),r(:))
         Generating create(z_p(:),z(:))
iccl_cg:
    691, Generating create(q(:))
         Generating copyin(valscl(:))
         Generating create(r_p(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating copy(x(:),r(:))
         Generating create(z_p(:),z(:))
[kato@photon32 CG]$ a.out

  tol = .1000000000D-12  max_iteration =   10000

  Matrix size =    1048576  x    1048576
  Check a number of non-zero element
  idx =      5238785 nz =      5238784

 ----------------------------------------------
  Bi-Conjugate Gradient Stabilized (BiCGStab)
 ----------------------------------------------
 nrmr0=    32.00000000000000      tol*nrmr0=   3.2000000000000001E-012

 The analysis phase of ILU(0)                   0.625 (sec)

 iteration=          981  residual =   3.0842327270017723E-012
 Bi-Conjugate Gradient Stabilized (BiCGStab) is passed

 Elapased time for BiCGStab with ILU(0) :            39.010 (sec)

 error =   2.6423307986078726E-014
 x(      1)=0.499998958505E+00
 x(      2)=0.697650643317E+00
 x(      3)=0.790607780743E+00
 x(      4)=0.841500545499E+00
 x(      5)=0.872868905925E+00
 x(      6)=0.893963985576E+00
 x(      7)=0.909079245320E+00
 x(      8)=0.920430160460E+00
 x(      9)=0.929263761574E+00
 x(     10)=0.936332649288E+00

 x(1048566)=0.114542005404E-04
 x(1048567)=0.104132673427E-04
 x(1048568)=0.937223193541E-05
 x(1048569)=0.833110453556E-05
 x(1048570)=0.728989536154E-05
 x(1048571)=0.624861463278E-05
 x(1048572)=0.520727256966E-05
 x(1048573)=0.416587939333E-05
 x(1048574)=0.312444532565E-05
 x(1048575)=0.208298058897E-05
 x(1048576)=0.104149540607E-05

OpenACC ディレクティブによるプログラミング

12 - 2 章　cuSPARSE/OpenACC を利用した線形方程式の反復解法プログラム

1. cuSPARSE を使用した Conjugate Gradient(CG) 共役勾配法プログラム

Conjugate Gradient(CG) using cuSPARSE/cuBLAS (OpenACC Fortran)

Preconditioned Iterative Methods using OpenACC, cuSPARSE, cuBLAS

コンパイル＆実行

2. cuSPARSE を使用した Bi-Conjugate Gradient Stabilized（BiCGStab）

コンパイル＆実行

OpenACC ディレクティブによるプログラミング

12 - 2 章 cuSPARSE/OpenACC を利用した線形方程式の反復解法プログラム

1. cuSPARSE を使用した Conjugate Gradient(CG) 共役勾配法プログラム

Conjugate Gradient(CG) using cuSPARSE/cuBLAS (OpenACC Fortran)

Preconditioned Iterative Methods using OpenACC, cuSPARSE, cuBLAS

コンパイル＆実行

2. cuSPARSE を使用した Bi-Conjugate Gradient Stabilized（BiCGStab）

コンパイル＆実行

12 - 2 章　cuSPARSE/OpenACC を利用した線形方程式の反復解法プログラム