OpenACC ディレクティブによるプログラミング

12 - 4 章 スパース行列用 Linear Solver(直接法 cuSOLVERライブラリ)の利用

1. Sparse LAPACK ルーチン(cuSOLVER SP) with OpenACC プログラム例

 前章 12-3 では、OpenACC + cuSPARSE を利用した Bi-Conjugate Gradient Stabilized(BiCGStab)プログラムを用いて、スパース行列に対する「反復法による線形方程式」を解く例を示した。この章では、スパース行列を対象とした「直接法による線形方程式」を解くプログラム例を示す。いわゆる、スパース行列用の LAPACK ルーチンである。これは、cuSOLVER SP ライブラリの関数を使用することにより実現する。OpenACC を利用しているため、極めて少ない工数で、GPUアクセラレータを使用した計算が可能となる。ここでは、Fortran プログラムの例を示すこととする。ここで例示するソース・ファイルと Makefile 一式をアーカイブした tar ファイルを提供する。以下のファイルをダウンロードしていただきたい。

 プログラム tar ファイル:openacc_cuSOLVER-SP.tar.gz

 以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V
pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

 また、使用した GPU は以下のとおりである。

$ pgaccelinfo

CUDA Driver Version:           9000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  384.59  Wed Jul 19 23:53:34 PDT 2017

Device Number:                 0
Device Name:                   GeForce GTX 1080 Ti
Device Revision Number:        6.1
Global Memory Size:            11714691072
Number of Multiprocessors:     28
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1657 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             5505 MHz
Memory Bus Width:              352 bits
L2 Cache Size:                 2883584 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc60

COO フォーマットから CSR フォーマットへの変換

 COO フォーマットから CSR フォーマットへの変換プログラムの概要に関しては、前章の説明を参照していただきたい。

 cuSOLVER SP ルーチンを使用する上での注意点を述べる。一般に対称行列として、行列要素が上三角行列のみデータとして収納されているような場合は、必ず、下三角行列要素も収納する形態に拡張して、cuSOLVER SP ルーチンに入力する必要がある。そこで、mm_convert_csr.f90 プログラムは、Matrix Market フォーマットの「symmetric」属性を有する対称行列要素を「一般行列」に拡張するようにしている。プログラム内で、対称行列を一般行列に拡張するためのフラグ変数は extendSymMatrix であり、これを1にセットすることで制御している。

 入力データとして必要なスパース行列の CSR フォーマットの詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。

nz 整数 行列内の非ゼロ要素の数
csrValA 配列 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format.
csrRowPtrA 配列 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる。ここでは 1 を使用する。
csrColIndA 配列 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列

cuSOLVER SP ルーチン概要

 NVIDIA cuSOLVER SP ルーチンの技術的なドキュメントは、こちらのページを参照して欲しい。cuSolverSP は、スパース QR 分解に基づく直接法による新しいスパースルーチンのセットを提供している。また、その他に、コレスキー分解による解法、LU分解による解法ルーチンも提供している。今回のプログラムでは、この3種のソルバーが動作するように作成してある。数値計算シミュレーションにおいては、全ての行列が factorization の並列性に優れたスパース性のパターンを有している訳ではないため、cuSolverSP ライブラリは、シーケンシャルのような行列を扱うことができるように CPU 上で動作するルーチンも提供している(なお、LU 分解用のルーチンは CPU 用のみであり、GPU用のルーチンは CUDA 9.0 の段階ではまだ提供されていない)。 高い並列性を持つマトリックスを対象とした場合は、GPU 用のルーチンはより高いパフォーマンスを提供することができる。なお、LU 分解を除く CPU 用のコレスキー分解と QR 分解のルーチンは、OpenMP による並列プログラムとなっており、マルチコアによる並列実行が可能である。

 cuSOLVER SP ルーチンには、行列要素のリオーダリング(再順序付け)機能を備えており、関数の引数により制御可能となっている。分解におけるパフォーマンスに大きな影響を与える fill-in の発生を減らすために、symrcm とsymamd の2つの並べ替えスキームを提供している。 引数である入力パラメータ reorder は、reorder が 1 の場合に symrcm(Symmetric Reverse Cuthill-McKee置換)、2 の場合 symamd (Symmetric Approximate Minimum Degree Algorithm based on Quotient Graph) を有効にする。それ以外の場合、並べ替えは実行されない。

 分解時に行列が singular の状態になった場合は、それが発生した最小の行インデックスを出力し、WARNING を出す。その計算結果は不正となる。

 ここでは、今回使用しているスパース行列用の線形方程式解法ルーチンのサマライズを以下に纏める。提供プログラムの sp_solvers.f90 ルーチンがドライバールーチンとなる。

solver No.
(solver)
cuSOLVER関数名 CPU用 GPU用 Reordering
(reorder)
1 コレスキー分解
cusolverSpDcsrlsvchol[Host]
有り
OMP_NUM_THREADSで
並列スレッド数指定
有り No reordering = 0

symrcm reordering = 1

symamd reordering = 2
2 LU分解
cusolverSpDcsrlsvlu[Host]
有り
シングルスレッドのみ
なし
3 QR分解
cusolverSpDcsrlsvqr[Host]
有り
OMP_NUM_THREADSで
並列スレッド数指定
有り

メインプログラムと cuSOLVER SP ルーチンを利用するインタフェース

 PGI 17.10 時点では CUDA C/C++ 言語で開発されている cuSOLVER SP(スパース)系のライブラリへの Fortran Interface Module が提供されていないため、以下のように自作(cusolverSP_mod.cuf)した。C と Fortran のバインディングを取り持つためのインタフェースである。将来的に、PGI コンパイルシステムの中に、cuSOLVER SP の Fortran MODULE (多分 use cusplverSP で使用できるはず)が提供された場合は、以下の cusolverSP_mod.cuf ルーチンは必要なくなる。

cusolverSP_mod.cuf

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module cuSOLVER_SP_Solver
  use iso_c_binding
  use cudafor
  use cusparse
  use cusolver_common
  implicit none

! enum, bind(C) ! cusolverStatus_t
!   enumerator :: CUSOLVER_STATUS_SUCCESS=0
!   enumerator :: CUSOLVER_STATUS_NOT_INITIALIZED=1
!   enumerator :: CUSOLVER_STATUS_ALLOC_FAILED=2
!   enumerator :: CUSOLVER_STATUS_INVALID_VALUE=3
!   enumerator :: CUSOLVER_STATUS_ARCH_MISMATCH=4
!   enumerator :: CUSOLVER_STATUS_MAPPING_ERROR=5
!   enumerator :: CUSOLVER_STATUS_EXECUTION_FAILED=6
!   enumerator :: CUSOLVER_STATUS_INTERNAL_ERROR=7
!   enumerator :: CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED=8
!   enumerator :: CUSOLVER_STATUS_NOT_SUPPORTED = 9
!   enumerator :: CUSOLVER_STATUS_ZERO_PIVOT=10
!   enumerator :: CUSOLVER_STATUS_INVALID_LICENSE=11
! end enum

  type cusolverSpHandle
    type(c_ptr) :: handle
  end type cusolverSpHandle

! cusolverSpCreate
  interface
     integer(c_int) function cusolverSpCreate(handle) bind(C,name='cusolverSpCreate')
       import cusolverSpHandle
       type(cusolverSpHandle) :: handle
     end function cusolverSpCreate
  end interface
! cusolverSpDestroy
  interface
     integer(c_int) function cusolverSpDestroy(handle) bind(C,name='cusolverSpDestroy')
       import cusolverSpHandle
       type(cusolverSpHandle), value :: handle
     end function cusolverSpDestroy
  end interface

! cusolverSpXcsrissymHost
  interface
     integer(c_int) function cusolverSpXcsrissymHost(handle, m, nnzA, descrA, &
                    csrRowPtrA, csrEndPtrA, csrColIndA, issym) bind(C,name='cusolverSpXcsrissymHost')
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusparseMatDescr), value :: descrA
       type(cusolverSpHandle), value :: handle
       integer(c_int), value  :: m, nnzA
       integer(c_int)         :: issym
       integer(c_int)         :: csrRowPtrA(*), csrEndPtrA(*), csrColIndA(*)
     end function cusolverSpXcsrissymHost
  end interface

! cusolverSpXcsrsymrcmHost
  interface
     integer(c_int) function cusolverSpXcsrsymrcmHost(handle, m, nnzA, descrA, &
                    csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymrcmHost')
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusparseMatDescr), value :: descrA
       type(cusolverSpHandle), value :: handle
       integer(c_int), value  :: m, nnzA
       integer(c_int)         :: q(*)
       integer(c_int)         :: csrRowPtrA(*), csrColIndA(*)
     end function cusolverSpXcsrsymrcmHost
  end interface

! cusolverSpXcsrsymmdqHost
  interface
     integer(c_int) function cusolverSpXcsrsymmdqHost(handle, m, nnzA, descrA, &
                    csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymmdqHost')
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusparseMatDescr), value :: descrA
       type(cusolverSpHandle), value :: handle
       integer(c_int), value         :: m, nnzA
       integer(c_int)                :: q(*)
       integer(c_int)                :: csrRowPtrA(*), csrColIndA(*)
     end function cusolverSpXcsrsymmdqHost
  end interface

! cusolverSpXcsrperm_bufferSizeHost
  interface
     integer(c_int) function cusolverSpXcsrperm_bufferSizeHost(handle, m, n, nnzA, &
                    descrA, csrRowPtrA, csrColIndA, p, q, bufferSizeInBytes ) &
                    bind(C,name="cusolverSpXcsrperm_bufferSizeHost")
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusolverSpHandle), value :: handle
       type(cusparseMatDescr), value :: descrA
       integer(c_int), value         :: m, n, nnzA
       integer(c_int)                :: p(*), q(*)
       integer(c_int)                :: csrRowPtrA(*), csrColIndA(*)
       integer(c_long)                  :: bufferSizeInBytes
     end function cusolverSpXcsrperm_bufferSizeHost
  end interface

! cusolverSpXcsrpermHost
  interface
    integer(c_int) function cusolverSpXcsrpermHost(handle, m, n, nnzA, &
                   descrA, csrRowPtrA, csrColIndA, p, q, map, buffer ) &
                   bind(C,name="cusolverSpXcsrpermHost")
      use iso_c_binding
      import cusolverSpHandle, cusparseMatDescr
      type(cusolverSpHandle), value :: handle
      type(cusparseMatDescr), value :: descrA
      integer(c_int), value         :: m, n, nnzA
      integer(c_int)                :: p(*), q(*)
      integer(c_int)                :: csrRowPtrA(*), csrColIndA(*)
      integer(c_int)                :: map(*)
      character(c_char)             :: buffer(*)
    end function cusolverSpXcsrpermHost
  end interface

! cusolverSpDcsrlsvcholHost
  interface
        integer(c_int) function cusolverSpDcsrlsvcholHost( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvcholHost")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value :: handle
          type(cusparseMatDescr), value :: descrA
          integer(c_int), value         :: m, nnzA
          real(c_double), intent(in)    :: csrValA(:)
          integer(c_int), intent(in)    :: csrRowPtrA(:), csrColIndA(:)
          real(c_double), intent(in)    :: b(*)
          real(c_double)                :: x(*)
          real(c_double), value         :: tol
          integer(c_int), value         :: reorder
          integer(c_int)                :: singularity
          end function cusolverSpDcsrlsvcholHost
        end interface

! cusolverSpDcsrlsvluHost
  interface
        integer(c_int) function cusolverSpDcsrlsvluHost( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvluHost")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value :: handle
          type(cusparseMatDescr), value, intent(in) :: descrA
          integer(c_int), value         :: m, nnzA
          real(c_double), intent(in)    :: csrValA(*)
          integer(c_int), intent(in)    :: csrRowPtrA(*), csrColIndA(*)
          real(c_double), intent(in)    :: b(*)
          real(c_double)                :: x(*)
          real(c_double), value         :: tol
          integer(c_int), value         :: reorder
          integer(c_int)                :: singularity
    end function cusolverSpDcsrlsvluHost
  end interface

! cusolverSpDcsrlsvqrHost
  interface
    integer(c_int) function cusolverSpDcsrlsvqrHost( handle, m, nnzA, &
                   descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                   bind(C,name="cusolverSpDcsrlsvqrHost")
      use iso_c_binding
      import cusolverSpHandle, cusparseMatDescr
      type(cusolverSpHandle), value :: handle
      type(cusparseMatDescr), value, intent(in) :: descrA
      integer(c_int), value         :: m, nnzA
      real(c_double), intent(in)    :: csrValA(*)
      integer(c_int), intent(in)    :: csrRowPtrA(*), csrColIndA(*)
      real(c_double), intent(in)    :: b(*)
      real(c_double)                :: x(*)
      real(c_double), value         :: tol
      integer(c_int), value         :: reorder
      integer(c_int)                :: singularity
    end function cusolverSpDcsrlsvqrHost
  end interface

! cusolverSpDcsrlsvchol
  interface
        integer(c_int) function cusolverSpDcsrlsvchol( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvchol")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value       :: handle
          type(cusparseMatDescr), value       :: descrA
          integer(c_int), value               :: m, nnzA
          real(c_double), device, intent(in)  :: csrValA(*)
          integer(c_int), device, intent(in)  :: csrRowPtrA(*), csrColIndA(*)
          real(c_double), device, intent(in)  :: b(*)
          real(c_double), device              :: x(*)
          real(c_double), value               :: tol
          integer(c_int), value               :: reorder
          integer(c_int)                      :: singularity
        end function cusolverSpDcsrlsvchol
  end interface

! cusolverSpDcsrlsvqr
  interface
        integer(c_int) function cusolverSpDcsrlsvqr( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvqr")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value       :: handle
          type(cusparseMatDescr), value       :: descrA
          integer(c_int), value               :: m, nnzA
          real(c_double), device, intent(in)  :: csrValA(*)
          integer(c_int), device, intent(in)  :: csrRowPtrA(*), csrColIndA(*)
          real(c_double), device, intent(in)  :: b(*)
          real(c_double), device              :: x(*)
          real(c_double), value               :: tol
          integer(c_int), value               :: reorder
          integer(c_int)                      :: singularity
        end function cusolverSpDcsrlsvqr
  end interface

end module cuSOLVER_SP_Solver

 以下に、メインプログラムを示す。このプログラムは、複素数の係数行列は対応していない。倍精度実数型のデータのみを扱える。最初に、Matrix Market 形式のファイル名を指定し、次に right hand side(rhs) データファイルがあれば、それを指定する。もし、存在しなければ、rhs ベクトルは全て 1.0 として計算する。入力データの読み込み後、coo2csr_convert ルーチンを call して、COO 形式データからCSR 形式データに変換する。変換後、csrRowPtrA、csrColIndA、csrValA 配列が作成され、この CSR 形式の行列を実引数として、cusolverSPs ドライバールーチンに渡し、計算が始まる。このプログラムとは別に、ユーザ側のプログラムで、csrRowPtrA、csrColIndA、csrValA 配列を用意できれば、cusolverSPs ドライバールーチンを使用して線形方程式を解くことができる。

main.f90

program main
  use precision
  use Matrix_data
  use mm_coo_cuda_csr_convert
  use SP_solvers
  implicit none

    integer                  :: i, j, istat
    integer,parameter        :: iunit1 = 10       ! for Matrix Market data (COO)
    integer,parameter        :: iunit2 = 11       ! for LHS, if available
    character*80             :: mm_file, rhs_file
    integer                  :: extendSymMatrix   ! 1 = symmetrize the pattern (transform from triangular to full)

    integer                  :: gpuonly           ! 0 = both CPU and GPU, 1 = GPU only
    integer                  :: solver            ! 1 = cholesky, 2 = LU, 3 = QR
    integer                  :: reorder           ! Reordering ; 0 = no, 1 = symrcm, 2 = symamd reordering
!   real(fp_kind), parameter :: tol = 0.0d0       ! tolerance to decide if singular or not
    real(fp_kind), parameter :: tol = 1.0d-12     ! tolerance to decide if singular or not
    integer, parameter       :: index_base = 1    ! CSR format -- one based index

! Kind of Solver, a scheme of reordering

    print *, "0 = CPU and GPU, 1 = GPU only?"
    read (5, '(i1)')  gpuonly
    print *, "1 = cholesky, 2 = LU, 3 = QR  ; What solver?"
    read (5, '(i1)')  solver
    if ( .not. (solver == 1 .or. solver == 2 .or. solver == 3) ) solver = 1

    print *, "0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?"
    read (5, '(i1)')  reorder

    print '(/1x,a)', "================================================="
    if ( solver == 1) print *, "Sparse matrix linear solver schem : Cholesky"
    if ( solver == 2) print *, "Sparse matrix linear solver schem : LU"
    if ( solver == 3) print *, "Sparse matrix linear solver schem : QR"

    if ( reorder == 1 ) then
      print *, "Reordering scheme of matrix : symrcm"
    else if ( reorder == 2 ) then
      print *, "Reordering scheme of matrix : symamd"
    else
      reorder = 0
      print *, "No reordering of matrix"
    end if
    print '(1x,a/)', "================================================="

! Read matrix Market data file

    extendSymMatrix = 1        !  if =1 .and. symmetric matrix=yes , transform from triangular to full

    print *, "File name for Matrix Market?"
       read (5, '(a)')  mm_file
    !                                mm_file ="lap2D_5pt_n100.mtx"
    print '(/1x,a,a)', "Matrix Market file name = ",mm_file
    open (iunit1, file=mm_file, status='old')
    print *
    call mm_matrix_read(iunit1, extendSymMatrix)

! Read right hand side vectors, if available.

    print *, "File name for LHS vector? if no available, press ENTER and set as RHS=1.0."
    print *
    read (5, '(a)')  rhs_file
    if (rhs_file == "") then
       allocate (rhs(rows))
       rhs= 1.0_fp_kind
    else
      print *, "rhs_file name = ",rhs_file
      open (iunit2, file=rhs_file, status='old')
      call mm_rhs_read(iunit2)
    end if

! Convert COO to CSR sparse matrix
!   INPUT   : Matrix COO data from Matrix Market format via mm_rhs_read routine
!   OUTPUT  : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data

    call coo2csr_convert

! CSR sparse matrix data are set. The belows are arguments for cusplverSPs routine.
! Arrays for CUDA cuSPARSE CSR format
!   integer       :: m                 IN    ! rows size for square matrix
!   integer       :: n                 IN    ! columns size for square matrix (n=m)
!   integer       :: nz                IN    ! nonzero element of A
!   integer       :: csrRowPtrA (m+1)  IN    ! the array of offsets corresponding to the start of each row
!   integer       :: csrColIndA (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: csrValA    (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: x(m)              INOUT ! initial guess & result
!   real(fp_kind) :: rhs(m)            IN    ! right hand side

! Print CSR Matrix data
!    print *, "csrValA", csrValA
!    print *, "csrRowPtrA",csrRowPtrA
!    print *, "csrColIndA", csrColIndA


! Initilize X
    allocate (x(n))
    x = 0.0_fp_kind      ! or Initial guess X is set

! cuSOLVER SP (Cholesky, LU, QR)

    call cusolverSPs (gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                      tol, index_base)

end program main

sp_solvers.f90

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!

module SP_solvers
    use precision
    implicit none

  contains

    real(fp_kind) function vec_max(n, x)
    implicit none
    integer       :: i, n
    real(fp_kind) :: x(n)
      vec_max = 0.0_fp_kind
      do i = 1, n
      vec_max = max( abs(x(i)), vec_max )
      end do
    end function

    real(fp_kind) function csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
    implicit none
    integer       :: m, n, nz
    integer       :: csrRowPtrA(m+1), csrColIndA(nz)
    real(fp_kind) :: csrValA(nz)
    real(fp_kind) :: sum, norminf
    integer       :: i, j, start, end
    norminf = 0.0_fp_kind

    do i = 1, m
      sum = 0.0_fp_kind
      start = csrRowPtrA(i  )
      end   = csrRowPtrA(i+1) - 1
        do j = start, end
          sum = sum + abs(csrValA(j))
        end do
      norminf = max (norminf, sum)
    end do
    csr_mat_max = norminf
    end function




!=============================================================
    subroutine cusolverSPs(gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                          tol, index_base)
!-------------------------------------------------------------
!   Bi-Conjugate Gradient Stabilized (BiCGStab)
!   A is an n × n sparse matrix defined in CSR storage format
!-------------------------------------------------------------
    use measure_time
    use cusparse
    use cuSOLVER_SP_Solver
    implicit none

! Arrays for CUDA cuSPARSE CSR format
    integer       :: m                ! number of rows of matrix
    integer       :: n                ! number of columns of matrix
    integer       :: nz               ! nonzero elements of A
    integer       :: csrRowPtrA(n+1)  ! the array of offsets corresponding to the start of each row
    integer       :: csrColIndA (nz)  ! sorted by row and by column within each row
    real(fp_kind) :: csrValA    (nz)  ! sorted by row and by column within each row
                                      ! B = Q*A*Q^T
    real(fp_kind) :: x  (n)           ! initial guess & result
    real(fp_kind) :: rhs(n)           ! right hand side
    real(fp_kind) :: b  (n)           ! a copy of rhs
    character(c_char), allocatable :: buffer(:)
    integer       :: index_base

    integer       :: gpuonly          ! 0 = both CPU and GPU, 1 = GPU only
    integer       :: solver           ! 1 = cholesky, 2 = LU, 3 = QR
    integer       :: reorder          ! manula reordering 1 = symrcm, 2 = symamd reordering
    integer       :: singularity      ! -1 if A is invertible under tol
    integer       :: issym
    real(fp_kind) :: rsum, diff, error
    real(fp_kind) :: x_inf, r_inf, A_inf

    real(fp_kind) :: tol              !  to decide if singular or not

! Local variables
    type(cusolverSpHandle)   :: cusolver_H
    type(cusparseHandle)     :: cusparse_H
    type(cusparseMatDescr)   :: descrA
    integer                  :: cusolver_status
    integer                  :: status, ierr_code
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0

    integer                  :: i, j, k
    integer                  :: nprint
! Parameters
    integer, parameter       :: idbg = 0
    real(fp_kind),parameter  :: eps = 6.9d-310     ! machine epsilon

! Initialize
    singularity = 0
    issym   = 0

    if (m /= n) then
      print *,"Error: only support square matrix."
      stop
    end if
!
    print '(/1x,a)',"=================================================================="
    if (solver == 1) then
      print '(1x,a)', "The linear system is solved by sparse CHOLESKY factorization."
    else if (solver == 2) then
      print '(1x,a)', "The linear system is solved by sparse LU factorization."
    else if (solver == 3) then
      print '(1x,a)', "The linear system is solved by sparse QR factorization."
    end if
    if (reorder == 0) then
      print '(1x,a)', "Reordering scheme : No reordering"
    else if (reorder == 1) then
      print '(1x,a)', "Reordering scheme : symrcm"
    else if (reorder == 2) then
      print '(1x,a)', "Reordering scheme : symamd"
    end if
    print '(/1x,a,d15.10,2x,a,i8/)',"Tolerance to decide if singular or not = ", tol

!   set b = rhs, where Ax = rhs (Ax = b)
    b(:) = rhs(:)

    k = 0

! Create cusolver/cusparse handle, and matrix descriptor

    status = cusolverSpCreate(cusolver_H)
    status = cusparseCreate(cusparse_H)
    status = cusparseCreateMatDescr(descrA)
    status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_NON_UNIT)
    status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base=1

! Verify if A has symmetric pattern or not
    status = cusolverSpXcsrissymHost( &
        cusolver_H, m, nz, descrA, csrRowPtrA(1), csrRowPtrA(2), csrColIndA, issym)
    if (issym == 0) then
      print *, "A matrix has no symmmetric pattern."
    else
      print *, "A matrix has symmmetric pattern."
    end if
    print '(1x,a)',"=================================================================="
    if(idbg .eq. 1) print *, "cusolverSpXcsrissymHost-status=", status

! Clock init
    status = init_clock
    if (gpuonly == 1 ) goto 100

! Clock start
       status = start_clock

    if ( solver == 1 ) then                                                 ! cholesky
      status = cusolverSpDcsrlsvcholHost(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .ge. 1)  then
        print *
        print *,"singularity = ", singularity
        print *,"cusolverSpDcsrlsvcholHost-status = ", status
      end if

    else if ( solver == 2 ) then                                            ! LU
      status = cusolverSpDcsrlsvluHost(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .ge. 1)  then
        print *
        print *,"cusolverSpDcsrlsvluHost-status = ", status
        print *,"singularity = ", singularity
      end if

    else if ( solver == 3 ) then                                            ! QR
      status = cusolverSpDcsrlsvqrHost(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .eq. 1) then
        print *
        print *,"cusolverSpDcsrlsvqrHost-status = ", status
        print *,"singularity = ", singularity
      end if
    else
       print *,"Error: Solver",solver," is unknown solver"
    end if

       status = end_clock
       time_tag = "cuSOLVER SP solver result on [ CPU ]"
       status = print_time(time_tag)

!$acc data copy(b, X)  copyin( csrValA, csrRowPtrA, csrColIndA )
  !$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)
    ! evaluate residual b = b - A*x (result on CPU)
    status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
             m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b)
  !$acc end host_data
!$acc end data

    if ( singularity .ge. 0 ) &
    print '(/2x,a,i5,,2x,a, d10.3/)', "WARNING!! : the matrix is singular at row", singularity+1, "under tolerance", tol
    r_inf = vec_max (m, b)
    x_inf = vec_max (n, x)
    A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
    print '(2x,a,d20.13)',"|b - A*x| =", r_inf
    print '(2x,a,d20.13)',"      |A| =", A_inf
    print '(2x,a,d20.13)',"      |x| =", x_inf
    print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf)

! output result
    print *
    nprint = n
    if ( n > 10 ) nprint = 10
    do i = 1, nprint
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do
    print *,""
    do i = n-nprint, n
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do

!   reset b = rhs, where Ax = rhs (Ax = b)
    b(:) = rhs(:)
    x = 0.0_fp_kind  ! reset X

100 continue

! Clock start
      status = start_clock

!$acc data copy(b, X)  copyin( csrValA, csrRowPtrA, csrColIndA )
!$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)

    if ( solver == 1 ) then                                                 ! cholesky
      status = cusolverSpDcsrlsvchol(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .eq. 1)  then
        print *
        print *,"singularity = ", singularity
        print *,"cusolverSpDcsrlsvchol-status = ", status
      end if

    else if ( solver == 2 ) then                                            ! LU
      print '(/a)'," LU SP solver for GPU is not available now."
      print *,"-------------------------------------------"
      stop

    else if ( solver == 3 ) then                                            ! QR
      status = cusolverSpDcsrlsvqr(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .eq. 1) then
        print *
        print *,"cusolverSpDcsrlsvqr-status = ", status
        print *,"singularity = ", singularity
      end if
    else
       print *,"Error: Solver",solver," is unknown solver"
    end if

!$acc end host_data
!$acc end data

      status = end_clock
      time_tag = "cuSOLVER SP solver result on [ GPU ]"
      status = print_time(time_tag)

!$acc data copy(b, X)  copyin( csrValA, csrRowPtrA, csrColIndA )
  !$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)
    ! evaluate residual b = b - A*x (result on CPU)
    status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
             m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b)
  !$acc end host_data
!$acc end data

    if ( singularity .ge. 0 ) &
      print '(/2x,a,i5,,2x,a, d10.3/)', "Warning!! : the matrix is singular at row", singularity+1, "under tolerance", tol
    r_inf = vec_max (m, b)
    x_inf = vec_max (n, x)
    A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
    print '(2x,a,d20.13)',"|b - A*x| =", r_inf
    print '(2x,a,d20.13)',"      |A| =", A_inf
    print '(2x,a,d20.13)',"      |x| =", x_inf
    print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf)

! output result
    print *
    nprint = n
    if ( n > 10 ) nprint = 10
    do i = 1, nprint
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do
    print *,""
    do i = n-nprint, n
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do

    status = cusparseDestroy(cusparse_H)
    status = cusparseDestroyMatDescr(descrA)

    end subroutine cusolverSPs

End module SP_solvers

MTXファイルを使用して実行

 tar file 内に提供した Makefile の記述に沿って、コンパイル&リンクを行う。なお、この Makefile では、cuda9.0 ツールキットのライブラリを使用するように設定しているが、cuda8.0 を使用しても良い。cusparse と cublas ライブラリは、PGIコンパイラにバンドルしているルーチンを使用する形になっている。

Makefile によるコンパイル

$ make
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  precision.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed  mmio.f
mmread:
    154, Loop not vectorized/parallelized: potential early exits
    199, Loop not vectorized/parallelized: potential early exits
    203, Loop not vectorized/parallelized: potential early exits
    207, Loop not vectorized/parallelized: potential early exits
    212, Loop not vectorized/parallelized: potential early exits
    251, Loop not vectorized/parallelized: potential early exits
    255, Loop not vectorized/parallelized: potential early exits
    259, Loop not vectorized/parallelized: potential early exits
mminfo:
    455, Loop not vectorized/parallelized: potential early exits
mmwrite:
    678, Loop not vectorized/parallelized: contains call
    682, Loop not vectorized/parallelized: contains call
    686, Loop not vectorized/parallelized: contains call
    692, Loop not vectorized/parallelized: contains call
    709, Loop not vectorized/parallelized: contains call
    713, Loop not vectorized/parallelized: contains call
    718, Loop not vectorized/parallelized: contains call
lowerc:
    771, Loop not vectorized/parallelized: contains call
getwd:
    795, Loop not vectorized/parallelized: potential early exits
countwd:
    826, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda  cusolverSP_mod.cuf
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  mm_convert_csr.f90
matrix_printout:
     70, Loop not vectorized/parallelized: contains call
     79, Loop not vectorized/parallelized: contains call
set_matrix:
    140, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    150, Loop not vectorized/parallelized: contains call
    171, Memory copy idiom, loop replaced by call to __c_mcopy4
    172, Memory copy idiom, loop replaced by call to __c_mcopy4
    175, Memory copy idiom, loop replaced by call to __c_mcopy8
    179, Memory copy idiom, loop replaced by call to __c_mcopy4
    183, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    361, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    364, Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    374, Generating copy(csrvala(:),indx(:),rval(:),jndx(:),csrrowptra(:))
    392, Generating enter data copyin(p(:),pbuffer(:))
    411, Generating exit data delete(p(:),pbuffer(:))
    415, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    431, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  sp_solvers.f90
vec_max:
     33, Generated an alternate version of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
csr_mat_max:
     51, Generated an alternate version of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
cusolversps:
    139, Memory copy idiom, loop replaced by call to __c_mcopy8
    207, Generating copyin(csrvala(:))
         Generating copy(x(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    229, Loop not vectorized/parallelized: contains call
    233, Loop not vectorized/parallelized: contains call
    238, Memory copy idiom, loop replaced by call to __c_mcopy8
    245, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:),x(:))
    283, Generating copyin(csrvala(:))
         Generating copy(x(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    305, Loop not vectorized/parallelized: contains call
    309, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  main.f90
main:
     66, Memory set idiom, loop replaced by call to __c_mset8
     98, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.o mmio.o cusolverSP_mod.o measure_time.o mm_convert_csr.o sp_solvers.o main.o
 -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda -Mcudalib=cusparse,cusolver

 サンプル係数行列(mtxデータ)については、前章の説明 を参照して欲しい。提供した tar file の中には、サンプルとして lap3D_7pt_n20.mtx 等の mtx 形式の対称行列ファイルを含んでいるので、これをテスト用データして使用してほしい。

 make run コマンドで実行が開始される。入力 mtx ファイルが求められるので、入手した mtx ファイル名を指定する。次に rhs ベクトルファイルが存在する場合は、そのファイル名を指定する。存在しない場合は、enter を押下し、実行に進む。以下の例では、symrcm リオーダリングを施して、コレスキー分解にて計算した結果である。Intel Haswell CPU 上の実行では 4スレッドで 47 秒、1スレッドでは 160 秒の時間が掛かるが、GPU 上では geforce 1080ti という倍精度演算器の極めて少ない ボードを使用しても 19 秒程度で終了する。

実行

[kato@photon32]$ export OMP_NUM_THREADS=4 (CPU版の実行では4スレッド並列で実行する)
[kato@photon32]$ make run
$ make run
===== Program run =====
a.out
 0 = CPU and GPU, 1 = GPU only?                   CPU+ GPU 計算か、 GPUのみの計算?
0
 1 = cholesky, 2 = LU, 3 = QR  ; What solver?     ソルバーの種類選択
1
 0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering? Reordering scheme の選択
1

 =================================================
 Sparse matrix linear solver schem : Cholesky
 Reordering scheme of matrix : symrcm
 =================================================

 File name for Matrix Market?                     行列の入力ファイル名
/home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx

 Matrix Market file name = /home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx

 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =  525825
 cols(n)        =  525825
 Non-zero elm.  = 3674625
 extendSymMatrix=       1


 =======================================================================================
 NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers.
 =======================================================================================

 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.  
                                    ! RHS(右辺ベクトル)の用意がないため、enter

 i,indx,jndx            1            1            1   0.1000003178914388
 i,indx,jndx            2            1       523522  -4.9999841054280604E-002
 i,indx,jndx            3            1       523523  -4.9999841054280604E-002
 i,indx,jndx            4            2            2   0.2000009536743164
 i,indx,jndx            5            2       523905  -4.9999841054280604E-002
 i,indx,jndx            6            2       523906   3.1789143880208332E-007
 i,indx,jndx            7            2       524290  -9.9999682108561208E-002
 i,indx,jndx            8            2       524802  -4.9999841054280604E-002
 i,indx,jndx            9            3            3   0.1000006357828777
 i,indx,jndx           10            3       525057  -4.9999841054280646E-002
 i,indx,jndx      3674615       525823       525823   0.2000009536743135
 i,indx,jndx      3674616       525824        33153  -4.9999841054280535E-002
 i,indx,jndx      3674617       525824       131841  -4.9999841054276260E-002
 i,indx,jndx      3674618       525824       328447  -9.9999682108561472E-002
 i,indx,jndx      3674619       525824       426751   3.1789143884642877E-007
 i,indx,jndx      3674620       525824       525824   0.2000009536743123
 i,indx,jndx      3674621       525825            5  -4.9999841054280604E-002
 i,indx,jndx      3674622       525825       131841  -4.9999841054280694E-002
 i,indx,jndx      3674623       525825       328448   3.1789143862453213E-007
 i,indx,jndx      3674624       525825       525313  -9.9999682108558099E-002
 i,indx,jndx      3674625       525825       525825   0.2000009536743136

 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          4          9         13         16         21         25
         30         37         42
 csrColIndA=
          1     523522     523523          2     523905     523906     524290
     524802          3     525057
 csrValA  =
 0.1000D+00 -.5000D-01 -.5000D-01 0.2000D+00 -.5000D-01 0.3179D-06 -.1000D+00
 -.5000D-01 0.1000D+00 -.5000D-01

 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
    3674576    3674581    3674586    3674591    3674596    3674601    3674606
    3674611    3674616    3674621    3674626
 csrColIndA=
     525823      33153     131841     328447     426751     525824          5
     131841     328448     525313     525825
 csrValA=
 0.2000D+00 -.5000D-01 -.5000D-01 -.1000D+00 0.3179D-06 0.2000D+00 -.5000D-01
 -.5000D-01 0.3179D-06 -.1000D+00 0.2000D+00

 ==================================================================
 The linear system is solved by sparse CHOLESKY factorization.
 Reordering scheme : symrcm

 Tolerance to decide if singular or not = .1000000000D-11
 A matrix has symmmetric pattern.
 ==================================================================
                                                                                    4スレッド並列時
 cuSOLVER SP solver result on [ CPU ]                                               47.714441 (sec) 
 ==================================================================================================
  |b - A*x| = 0.3055902197957D-09
        |A| = 0.8000012715659D+00
        |x| = 0.2638521162680D+06
  |b - A*x|/(|A|*|x|) = 0.1447732059137D-14

 x(      1)=0.263852116268E+06
 x(      2)=0.263093899026E+06
 x(      3)=0.263844899226E+06
 x(      4)=0.263852116268E+06
 x(      5)=0.263093899027E+06
 x(      6)=0.263844899225E+06
 x(      7)=0.263199487139E+06
 x(      8)=0.262666049525E+06
 x(      9)=0.263304916648E+06
 x(     10)=0.262560622769E+06

 x( 525815)=0.263094481758E+06
 x( 525816)=0.263094376198E+06
 x( 525817)=0.263094281209E+06
 x( 525818)=0.263094196778E+06
 x( 525819)=0.263094122892E+06
 x( 525820)=0.263094059539E+06
 x( 525821)=0.263094006710E+06
 x( 525822)=0.263093964396E+06
 x( 525823)=0.263093932591E+06
 x( 525824)=0.263093911290E+06
 x( 525825)=0.263093900489E+06

 cuSOLVER SP solver result on [ GPU ]                                               19.054326 (sec)
 ==================================================================================================
  |b - A*x| = 0.1527951098979D-09
        |A| = 0.8000012715659D+00
        |x| = 0.2638521162680D+06
  |b - A*x|/(|A|*|x|) = 0.7238660295685D-15

 x(      1)=0.263852116268E+06
 x(      2)=0.263093899027E+06
 x(      3)=0.263844899226E+06
 x(      4)=0.263852116268E+06
 x(      5)=0.263093899027E+06
 x(      6)=0.263844899226E+06
 x(      7)=0.263199487139E+06
 x(      8)=0.262666049525E+06
 x(      9)=0.263304916649E+06
 x(     10)=0.262560622769E+06

 x( 525815)=0.263094481758E+06
 x( 525816)=0.263094376198E+06
 x( 525817)=0.263094281209E+06
 x( 525818)=0.263094196778E+06
 x( 525819)=0.263094122892E+06
 x( 525820)=0.263094059539E+06
 x( 525821)=0.263094006710E+06
 x( 525822)=0.263093964396E+06
 x( 525823)=0.263093932591E+06
 x( 525824)=0.263093911290E+06
 x( 525825)=0.263093900490E+06

2. Windows 版 プログラムソース

 上記で示したプログラム一式は、Linux 上で動作させることを前提としたものである。同じプログラムを Windows 上で動作させるために Makefile を変更したものとソース・ファイル一式を以下に提供する。以下のファイルをダウンロードしていただきたい。Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI 用の cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。

 プログラム tar ファイル:cuSOLVERsp_solvers.zip

 Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI が提供する cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。また、cuSOLVER SP ルーチンの Windows 版は、CPU 実行用のソルバー・ルーチンに対して、OpenMP 並列マルチスレッド対応がなされていないことである。CPU上のソルバーはシングル・スレッドで実行される。

 以下の例は、PGI 17.10 バージョンの PGI コマンドプロンプト(端末)を使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V

pgfortran 17.10-0 64-bit target on x86-64 Windows -tp sandybridge
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

 また、使用した GPU は以下のとおりである。

$ pgaccelinfo

CUDA Driver Version:           9000

Device Number:                 0
Device Name:                   GeForce GTX 780
Device Revision Number:        3.5
Global Memory Size:            3221225472
Number of Multiprocessors:     12
Number of SP Cores:            2304
Number of DP Cores:            768
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    941 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             3004 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 1572864 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc35
PGI$ make
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  precision.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed  mmio.f
mmread:
    154, Loop not vectorized/parallelized: potential early exits
    199, Loop not vectorized/parallelized: potential early exits
    203, Loop not vectorized/parallelized: potential early exits
    207, Loop not vectorized/parallelized: potential early exits
    212, Loop not vectorized/parallelized: potential early exits
    251, Loop not vectorized/parallelized: potential early exits
    255, Loop not vectorized/parallelized: potential early exits
    259, Loop not vectorized/parallelized: potential early exits
mminfo:
    455, Loop not vectorized/parallelized: potential early exits
mmwrite:
    678, Loop not vectorized/parallelized: contains call
    682, Loop not vectorized/parallelized: contains call
    686, Loop not vectorized/parallelized: contains call
    692, Loop not vectorized/parallelized: contains call
    709, Loop not vectorized/parallelized: contains call
    713, Loop not vectorized/parallelized: contains call
    718, Loop not vectorized/parallelized: contains call
lowerc:
    771, Loop not vectorized/parallelized: contains call
getwd:
    795, Loop not vectorized/parallelized: potential early exits
countwd:
    826, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda  cusolverSP_mod.cuf
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  mm_convert_csr.f90
matrix_printout:
     70, Loop not vectorized/parallelized: contains call
     79, Loop not vectorized/parallelized: contains call
set_matrix:
    140, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    150, Loop not vectorized/parallelized: contains call
    171, Memory copy idiom, loop replaced by call to __c_mcopy4
    172, Memory copy idiom, loop replaced by call to __c_mcopy4
    175, Memory copy idiom, loop replaced by call to __c_mcopy8
    179, Memory copy idiom, loop replaced by call to __c_mcopy4
    183, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    361, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    364, Generated 3 alternate versions of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    374, Generating copy(csrvala(:),indx(:),jndx(:),rval(:),csrrowptra(:))
    392, Generating enter data copyin(p(:),pbuffer(:))
    411, Generating exit data delete(pbuffer(:),p(:))
    415, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    431, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  sp_solvers.f90
vec_max:
     33, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
csr_mat_max:
     51, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
cusolversps:
    139, Memory copy idiom, loop replaced by call to __c_mcopy8
    207, Generating copy(x(:))
         Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    229, Loop not vectorized/parallelized: contains call
    233, Loop not vectorized/parallelized: contains call
    238, Memory copy idiom, loop replaced by call to __c_mcopy8
    239, Memory zero idiom, loop replaced by call to __c_mzero8
    246, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:),x(:))
    284, Generating copy(x(:))
         Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    306, Loop not vectorized/parallelized: contains call
    310, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  main.f90
main:
     66, Memory set idiom, loop replaced by call to __c_mset8
     98, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.obj mmio.obj cusolverSP_mod.obj measure_time.obj mm_convert_csr.obj sp_solvers.obj main.obj -acc -ta=tesla,
cuda9.0,cc35,cc60 -Mcuda -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64"   -Mcudalib=cusparse,cusolver
PGI$ make run
===== Program run =====
a.out
 0 = CPU and GPU, 1 = GPU only?
0
 1 = cholesky, 2 = LU, 3 = QR  ; What solver?
3
 0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?
1

 =================================================
 Sparse matrix linear solver schem : QR
 Reordering scheme of matrix : symrcm
 =================================================

 File name for Matrix Market?
lap3D_7pt_n20.mtx

 Matrix Market file name = lap3D_7pt_n20.mtx

 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =    8000
 cols(n)        =    8000
 Non-zero elm.  =   53600
 extendSymMatrix=       1


 =======================================================================================
 NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers.
 =======================================================================================

 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.


 i,indx,jndx            1            1            1    6.000000000000000
 i,indx,jndx            2            1            2   -1.000000000000000
 i,indx,jndx            3            1           21   -1.000000000000000
 i,indx,jndx            4            1          401   -1.000000000000000
 i,indx,jndx            5            2            1   -1.000000000000000
 i,indx,jndx            6            2            2    6.000000000000000
 i,indx,jndx            7            2            3   -1.000000000000000
 i,indx,jndx            8            2           22   -1.000000000000000
 i,indx,jndx            9            2          402   -1.000000000000000
 i,indx,jndx           10            3            2   -1.000000000000000
 i,indx,jndx        53590         7998         7998    6.000000000000000
 i,indx,jndx        53591         7998         7999   -1.000000000000000
 i,indx,jndx        53592         7999         7599   -1.000000000000000
 i,indx,jndx        53593         7999         7979   -1.000000000000000
 i,indx,jndx        53594         7999         7998   -1.000000000000000
 i,indx,jndx        53595         7999         7999    6.000000000000000
 i,indx,jndx        53596         7999         8000   -1.000000000000000
 i,indx,jndx        53597         8000         7600   -1.000000000000000
 i,indx,jndx        53598         8000         7980   -1.000000000000000
 i,indx,jndx        53599         8000         7999   -1.000000000000000
 i,indx,jndx        53600         8000         8000    6.000000000000000

 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          5         10         15         20         25         30
         35         40         45
 csrColIndA=
          1          2         21        401          1          2          3
         22        402          2
 csrValA  =
 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01
 -.1000D+01 -.1000D+01 -.1000D+01

 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
      53552      53557      53562      53567      53572      53577      53582
      53587      53592      53597      53601
 csrColIndA=
       7998       7999       7599       7979       7998       7999       8000
       7600       7980       7999       8000
 csrValA=
 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01
 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01

 ==================================================================
 The linear system is solved by sparse QR factorization.
 Reordering scheme : symrcm

 Tolerance to decide if singular or not = .1000000000D-11
 A matrix has symmmetric pattern.
 ==================================================================

 cuSOLVER SP solver result on [ CPU ]                                                2.355999 (sec)
 ==================================================================================================
  |b - A*x| = 0.5115907697473D-12
        |A| = 0.1200000000000D+02
        |x| = 0.2458019372584D+02
  |b - A*x|/(|A|*|x|) = 0.1734427507818D-14

 x(      1)=0.666997357463E+00
 x(      2)=0.100066138159E+01
 x(      3)=0.119691025971E+01
 x(      4)=0.132458774155E+01
 x(      5)=0.141248710662E+01
 x(      6)=0.147457380480E+01
 x(      7)=0.151841489418E+01
 x(      8)=0.154840911171E+01
 x(      9)=0.156717865908E+01
 x(     10)=0.157622105444E+01

 x(   7990)=0.157622105444E+01
 x(   7991)=0.157622105444E+01
 x(   7992)=0.156717865908E+01
 x(   7993)=0.154840911171E+01
 x(   7994)=0.151841489418E+01
 x(   7995)=0.147457380480E+01
 x(   7996)=0.141248710662E+01
 x(   7997)=0.132458774155E+01
 x(   7998)=0.119691025971E+01
 x(   7999)=0.100066138159E+01
 x(   8000)=0.666997357463E+00

 cuSOLVER SP solver result on [ GPU ]                                                0.683000 (sec)
 ==================================================================================================
  |b - A*x| = 0.4547473508865D-12
        |A| = 0.1200000000000D+02
        |x| = 0.2458019372584D+02
  |b - A*x|/(|A|*|x|) = 0.1541713340283D-14

 x(      1)=0.666997357463E+00
 x(      2)=0.100066138159E+01
 x(      3)=0.119691025971E+01
 x(      4)=0.132458774155E+01
 x(      5)=0.141248710662E+01
 x(      6)=0.147457380480E+01
 x(      7)=0.151841489418E+01
 x(      8)=0.154840911171E+01
 x(      9)=0.156717865908E+01
 x(     10)=0.157622105444E+01

 x(   7990)=0.157622105444E+01
 x(   7991)=0.157622105444E+01
 x(   7992)=0.156717865908E+01
 x(   7993)=0.154840911171E+01
 x(   7994)=0.151841489418E+01
 x(   7995)=0.147457380480E+01
 x(   7996)=0.141248710662E+01
 x(   7997)=0.132458774155E+01
 x(   7998)=0.119691025971E+01
 x(   7999)=0.100066138159E+01
 x(   8000)=0.666997357463E+00

前章へ

次章へ

OpenACCプログラミングのインデックスへ