PGIアクセラレータコンパイラ > OpenACCサイト・トップ > インデックス

OpenACC ディレクティブによるプログラミング

12 - 4 章　スパース行列用 Linear Solver（直接法 cuSOLVERライブラリ）の利用

1. Sparse LAPACK ルーチン(cuSOLVER SP) with OpenACC プログラム例

　前章 12-3 では、OpenACC + cuSPARSE を利用した Bi-Conjugate Gradient Stabilized（BiCGStab）プログラムを用いて、スパース行列に対する「反復法による線形方程式」を解く例を示した。この章では、スパース行列を対象とした「直接法による線形方程式」を解くプログラム例を示す。いわゆる、スパース行列用の LAPACK ルーチンである。これは、cuSOLVER SP ライブラリの関数を使用することにより実現する。OpenACC を利用しているため、極めて少ない工数で、GPUアクセラレータを使用した計算が可能となる。ここでは、Fortran プログラムの例を示すこととする。ここで例示するソース・ファイルと Makefile 一式をアーカイブした tar ファイルを提供する。以下のファイルをダウンロードしていただきたい。

　プログラム tar ファイル：openacc_cuSOLVER-SP.tar.gz

　以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V
pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

　また、使用した GPU は以下のとおりである。

$ pgaccelinfo

CUDA Driver Version:           9000
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  384.59  Wed Jul 19 23:53:34 PDT 2017

Device Number:                 0
Device Name:                   GeForce GTX 1080 Ti
Device Revision Number:        6.1
Global Memory Size:            11714691072
Number of Multiprocessors:     28
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1657 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             5505 MHz
Memory Bus Width:              352 bits
L2 Cache Size:                 2883584 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc60

COO フォーマットから CSR フォーマットへの変換

　COO フォーマットから CSR フォーマットへの変換プログラムの概要に関しては、前章の説明を参照していただきたい。

　cuSOLVER SP ルーチンを使用する上での注意点を述べる。一般に対称行列として、行列要素が上三角行列のみデータとして収納されているような場合は、必ず、下三角行列要素も収納する形態に拡張して、cuSOLVER SP ルーチンに入力する必要がある。そこで、mm_convert_csr.f90 プログラムは、Matrix Market フォーマットの「symmetric」属性を有する対称行列要素を「一般行列」に拡張するようにしている。プログラム内で、対称行列を一般行列に拡張するためのフラグ変数は extendSymMatrix であり、これを１にセットすることで制御している。

　入力データとして必要なスパース行列の CSR フォーマットの詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。

nz	整数	行列内の非ゼロ要素の数
csrValA	配列	行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format.
csrRowPtrA	配列	配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA（0）を含みます。一般に、csrRowPtrA（0）は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる。ここでは 1 を使用する。
csrColIndA	配列	配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列

cuSOLVER SP ルーチン概要

　NVIDIA cuSOLVER SP ルーチンの技術的なドキュメントは、こちらのページを参照して欲しい。cuSolverSP は、スパース QR 分解に基づく直接法による新しいスパースルーチンのセットを提供している。また、その他に、コレスキー分解による解法、LU分解による解法ルーチンも提供している。今回のプログラムでは、この3種のソルバーが動作するように作成してある。数値計算シミュレーションにおいては、全ての行列が factorization の並列性に優れたスパース性のパターンを有している訳ではないため、cuSolverSP ライブラリは、シーケンシャルのような行列を扱うことができるように CPU 上で動作するルーチンも提供している（なお、LU 分解用のルーチンは CPU 用のみであり、GPU用のルーチンは CUDA 9.0 の段階ではまだ提供されていない）。高い並列性を持つマトリックスを対象とした場合は、GPU 用のルーチンはより高いパフォーマンスを提供することができる。なお、LU 分解を除く CPU 用のコレスキー分解と QR 分解のルーチンは、OpenMP による並列プログラムとなっており、マルチコアによる並列実行が可能である。

　cuSOLVER SP ルーチンには、行列要素のリオーダリング（再順序付け）機能を備えており、関数の引数により制御可能となっている。分解におけるパフォーマンスに大きな影響を与える fill-in の発生を減らすために、symrcm とsymamd の2つの並べ替えスキームを提供している。引数である入力パラメータ reorder は、reorder が 1　の場合に symrcm（Symmetric Reverse Cuthill-McKee置換）、2 の場合 symamd (Symmetric Approximate Minimum Degree Algorithm based on Quotient Graph) を有効にする。それ以外の場合、並べ替えは実行されない。

　分解時に行列が singular の状態になった場合は、それが発生した最小の行インデックスを出力し、WARNING を出す。その計算結果は不正となる。

　ここでは、今回使用しているスパース行列用の線形方程式解法ルーチンのサマライズを以下に纏める。提供プログラムの sp_solvers.f90 ルーチンがドライバールーチンとなる。

solver No. (solver)	cuSOLVER関数名	CPU用	GPU用	Reordering (reorder)
1	コレスキー分解 cusolverSpDcsrlsvchol[Host]	有り OMP_NUM_THREADSで並列スレッド数指定	有り	No reordering = 0 symrcm reordering = 1 symamd reordering = 2
2	LU分解 cusolverSpDcsrlsvlu[Host]	有りシングルスレッドのみ	なし
3	QR分解 cusolverSpDcsrlsvqr[Host]	有り OMP_NUM_THREADSで並列スレッド数指定	有り

メインプログラムと cuSOLVER SP ルーチンを利用するインタフェース

　PGI 17.10 時点では CUDA C/C++ 言語で開発されている cuSOLVER SP(スパース）系のライブラリへの Fortran Interface Module が提供されていないため、以下のように自作(cusolverSP_mod.cuf)した。C と Fortran のバインディングを取り持つためのインタフェースである。将来的に、PGI コンパイルシステムの中に、cuSOLVER SP の Fortran MODULE (多分 use cusplverSP で使用できるはず）が提供された場合は、以下の cusolverSP_mod.cuf ルーチンは必要なくなる。

cusolverSP_mod.cuf

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module cuSOLVER_SP_Solver
  use iso_c_binding
  use cudafor
  use cusparse
  use cusolver_common
  implicit none

! enum, bind(C) ! cusolverStatus_t
!   enumerator :: CUSOLVER_STATUS_SUCCESS=0
!   enumerator :: CUSOLVER_STATUS_NOT_INITIALIZED=1
!   enumerator :: CUSOLVER_STATUS_ALLOC_FAILED=2
!   enumerator :: CUSOLVER_STATUS_INVALID_VALUE=3
!   enumerator :: CUSOLVER_STATUS_ARCH_MISMATCH=4
!   enumerator :: CUSOLVER_STATUS_MAPPING_ERROR=5
!   enumerator :: CUSOLVER_STATUS_EXECUTION_FAILED=6
!   enumerator :: CUSOLVER_STATUS_INTERNAL_ERROR=7
!   enumerator :: CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED=8
!   enumerator :: CUSOLVER_STATUS_NOT_SUPPORTED = 9
!   enumerator :: CUSOLVER_STATUS_ZERO_PIVOT=10
!   enumerator :: CUSOLVER_STATUS_INVALID_LICENSE=11
! end enum

  type cusolverSpHandle
    type(c_ptr) :: handle
  end type cusolverSpHandle

! cusolverSpCreate
  interface
     integer(c_int) function cusolverSpCreate(handle) bind(C,name='cusolverSpCreate')
       import cusolverSpHandle
       type(cusolverSpHandle) :: handle
     end function cusolverSpCreate
  end interface
! cusolverSpDestroy
  interface
     integer(c_int) function cusolverSpDestroy(handle) bind(C,name='cusolverSpDestroy')
       import cusolverSpHandle
       type(cusolverSpHandle), value :: handle
     end function cusolverSpDestroy
  end interface

! cusolverSpXcsrissymHost
  interface
     integer(c_int) function cusolverSpXcsrissymHost(handle, m, nnzA, descrA, &
                    csrRowPtrA, csrEndPtrA, csrColIndA, issym) bind(C,name='cusolverSpXcsrissymHost')
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusparseMatDescr), value :: descrA
       type(cusolverSpHandle), value :: handle
       integer(c_int), value  :: m, nnzA
       integer(c_int)         :: issym
       integer(c_int)         :: csrRowPtrA(*), csrEndPtrA(*), csrColIndA(*)
     end function cusolverSpXcsrissymHost
  end interface

! cusolverSpXcsrsymrcmHost
  interface
     integer(c_int) function cusolverSpXcsrsymrcmHost(handle, m, nnzA, descrA, &
                    csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymrcmHost')
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusparseMatDescr), value :: descrA
       type(cusolverSpHandle), value :: handle
       integer(c_int), value  :: m, nnzA
       integer(c_int)         :: q(*)
       integer(c_int)         :: csrRowPtrA(*), csrColIndA(*)
     end function cusolverSpXcsrsymrcmHost
  end interface

! cusolverSpXcsrsymmdqHost
  interface
     integer(c_int) function cusolverSpXcsrsymmdqHost(handle, m, nnzA, descrA, &
                    csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymmdqHost')
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusparseMatDescr), value :: descrA
       type(cusolverSpHandle), value :: handle
       integer(c_int), value         :: m, nnzA
       integer(c_int)                :: q(*)
       integer(c_int)                :: csrRowPtrA(*), csrColIndA(*)
     end function cusolverSpXcsrsymmdqHost
  end interface

! cusolverSpXcsrperm_bufferSizeHost
  interface
     integer(c_int) function cusolverSpXcsrperm_bufferSizeHost(handle, m, n, nnzA, &
                    descrA, csrRowPtrA, csrColIndA, p, q, bufferSizeInBytes ) &
                    bind(C,name="cusolverSpXcsrperm_bufferSizeHost")
       use iso_c_binding
       import cusolverSpHandle, cusparseMatDescr
       type(cusolverSpHandle), value :: handle
       type(cusparseMatDescr), value :: descrA
       integer(c_int), value         :: m, n, nnzA
       integer(c_int)                :: p(*), q(*)
       integer(c_int)                :: csrRowPtrA(*), csrColIndA(*)
       integer(c_long)                  :: bufferSizeInBytes
     end function cusolverSpXcsrperm_bufferSizeHost
  end interface

! cusolverSpXcsrpermHost
  interface
    integer(c_int) function cusolverSpXcsrpermHost(handle, m, n, nnzA, &
                   descrA, csrRowPtrA, csrColIndA, p, q, map, buffer ) &
                   bind(C,name="cusolverSpXcsrpermHost")
      use iso_c_binding
      import cusolverSpHandle, cusparseMatDescr
      type(cusolverSpHandle), value :: handle
      type(cusparseMatDescr), value :: descrA
      integer(c_int), value         :: m, n, nnzA
      integer(c_int)                :: p(*), q(*)
      integer(c_int)                :: csrRowPtrA(*), csrColIndA(*)
      integer(c_int)                :: map(*)
      character(c_char)             :: buffer(*)
    end function cusolverSpXcsrpermHost
  end interface

! cusolverSpDcsrlsvcholHost
  interface
        integer(c_int) function cusolverSpDcsrlsvcholHost( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvcholHost")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value :: handle
          type(cusparseMatDescr), value :: descrA
          integer(c_int), value         :: m, nnzA
          real(c_double), intent(in)    :: csrValA(:)
          integer(c_int), intent(in)    :: csrRowPtrA(:), csrColIndA(:)
          real(c_double), intent(in)    :: b(*)
          real(c_double)                :: x(*)
          real(c_double), value         :: tol
          integer(c_int), value         :: reorder
          integer(c_int)                :: singularity
          end function cusolverSpDcsrlsvcholHost
        end interface

! cusolverSpDcsrlsvluHost
  interface
        integer(c_int) function cusolverSpDcsrlsvluHost( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvluHost")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value :: handle
          type(cusparseMatDescr), value, intent(in) :: descrA
          integer(c_int), value         :: m, nnzA
          real(c_double), intent(in)    :: csrValA(*)
          integer(c_int), intent(in)    :: csrRowPtrA(*), csrColIndA(*)
          real(c_double), intent(in)    :: b(*)
          real(c_double)                :: x(*)
          real(c_double), value         :: tol
          integer(c_int), value         :: reorder
          integer(c_int)                :: singularity
    end function cusolverSpDcsrlsvluHost
  end interface

! cusolverSpDcsrlsvqrHost
  interface
    integer(c_int) function cusolverSpDcsrlsvqrHost( handle, m, nnzA, &
                   descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                   bind(C,name="cusolverSpDcsrlsvqrHost")
      use iso_c_binding
      import cusolverSpHandle, cusparseMatDescr
      type(cusolverSpHandle), value :: handle
      type(cusparseMatDescr), value, intent(in) :: descrA
      integer(c_int), value         :: m, nnzA
      real(c_double), intent(in)    :: csrValA(*)
      integer(c_int), intent(in)    :: csrRowPtrA(*), csrColIndA(*)
      real(c_double), intent(in)    :: b(*)
      real(c_double)                :: x(*)
      real(c_double), value         :: tol
      integer(c_int), value         :: reorder
      integer(c_int)                :: singularity
    end function cusolverSpDcsrlsvqrHost
  end interface

! cusolverSpDcsrlsvchol
  interface
        integer(c_int) function cusolverSpDcsrlsvchol( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvchol")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value       :: handle
          type(cusparseMatDescr), value       :: descrA
          integer(c_int), value               :: m, nnzA
          real(c_double), device, intent(in)  :: csrValA(*)
          integer(c_int), device, intent(in)  :: csrRowPtrA(*), csrColIndA(*)
          real(c_double), device, intent(in)  :: b(*)
          real(c_double), device              :: x(*)
          real(c_double), value               :: tol
          integer(c_int), value               :: reorder
          integer(c_int)                      :: singularity
        end function cusolverSpDcsrlsvchol
  end interface

! cusolverSpDcsrlsvqr
  interface
        integer(c_int) function cusolverSpDcsrlsvqr( handle, m, nnzA, &
                       descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
                       bind(C,name="cusolverSpDcsrlsvqr")
          use iso_c_binding
          import cusolverSpHandle, cusparseMatDescr
          type(cusolverSpHandle), value       :: handle
          type(cusparseMatDescr), value       :: descrA
          integer(c_int), value               :: m, nnzA
          real(c_double), device, intent(in)  :: csrValA(*)
          integer(c_int), device, intent(in)  :: csrRowPtrA(*), csrColIndA(*)
          real(c_double), device, intent(in)  :: b(*)
          real(c_double), device              :: x(*)
          real(c_double), value               :: tol
          integer(c_int), value               :: reorder
          integer(c_int)                      :: singularity
        end function cusolverSpDcsrlsvqr
  end interface

end module cuSOLVER_SP_Solver

　以下に、メインプログラムを示す。このプログラムは、複素数の係数行列は対応していない。倍精度実数型のデータのみを扱える。最初に、Matrix Market 形式のファイル名を指定し、次に right hand side(rhs) データファイルがあれば、それを指定する。もし、存在しなければ、rhs ベクトルは全て 1.0 として計算する。入力データの読み込み後、coo2csr_convert ルーチンを call して、COO 形式データからCSR 形式データに変換する。変換後、csrRowPtrA、csrColIndA、csrValA 配列が作成され、この CSR 形式の行列を実引数として、cusolverSPs ドライバールーチンに渡し、計算が始まる。このプログラムとは別に、ユーザ側のプログラムで、csrRowPtrA、csrColIndA、csrValA 配列を用意できれば、cusolverSPs ドライバールーチンを使用して線形方程式を解くことができる。

main.f90

program main
  use precision
  use Matrix_data
  use mm_coo_cuda_csr_convert
  use SP_solvers
  implicit none

    integer                  :: i, j, istat
    integer,parameter        :: iunit1 = 10       ! for Matrix Market data (COO)
    integer,parameter        :: iunit2 = 11       ! for LHS, if available
    character*80             :: mm_file, rhs_file
    integer                  :: extendSymMatrix   ! 1 = symmetrize the pattern (transform from triangular to full)

    integer                  :: gpuonly           ! 0 = both CPU and GPU, 1 = GPU only
    integer                  :: solver            ! 1 = cholesky, 2 = LU, 3 = QR
    integer                  :: reorder           ! Reordering ; 0 = no, 1 = symrcm, 2 = symamd reordering
!   real(fp_kind), parameter :: tol = 0.0d0       ! tolerance to decide if singular or not
    real(fp_kind), parameter :: tol = 1.0d-12     ! tolerance to decide if singular or not
    integer, parameter       :: index_base = 1    ! CSR format -- one based index

! Kind of Solver, a scheme of reordering

    print *, "0 = CPU and GPU, 1 = GPU only?"
    read (5, '(i1)')  gpuonly
    print *, "1 = cholesky, 2 = LU, 3 = QR  ; What solver?"
    read (5, '(i1)')  solver
    if ( .not. (solver == 1 .or. solver == 2 .or. solver == 3) ) solver = 1

    print *, "0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?"
    read (5, '(i1)')  reorder

    print '(/1x,a)', "================================================="
    if ( solver == 1) print *, "Sparse matrix linear solver schem : Cholesky"
    if ( solver == 2) print *, "Sparse matrix linear solver schem : LU"
    if ( solver == 3) print *, "Sparse matrix linear solver schem : QR"

    if ( reorder == 1 ) then
      print *, "Reordering scheme of matrix : symrcm"
    else if ( reorder == 2 ) then
      print *, "Reordering scheme of matrix : symamd"
    else
      reorder = 0
      print *, "No reordering of matrix"
    end if
    print '(1x,a/)', "================================================="

! Read matrix Market data file

    extendSymMatrix = 1        !  if =1 .and. symmetric matrix=yes , transform from triangular to full

    print *, "File name for Matrix Market?"
       read (5, '(a)')  mm_file
    !                                mm_file ="lap2D_5pt_n100.mtx"
    print '(/1x,a,a)', "Matrix Market file name = ",mm_file
    open (iunit1, file=mm_file, status='old')
    print *
    call mm_matrix_read(iunit1, extendSymMatrix)

! Read right hand side vectors, if available.

    print *, "File name for LHS vector? if no available, press ENTER and set as RHS=1.0."
    print *
    read (5, '(a)')  rhs_file
    if (rhs_file == "") then
       allocate (rhs(rows))
       rhs= 1.0_fp_kind
    else
      print *, "rhs_file name = ",rhs_file
      open (iunit2, file=rhs_file, status='old')
      call mm_rhs_read(iunit2)
    end if

! Convert COO to CSR sparse matrix
!   INPUT   : Matrix COO data from Matrix Market format via mm_rhs_read routine
!   OUTPUT  : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data

    call coo2csr_convert

! CSR sparse matrix data are set. The belows are arguments for cusplverSPs routine.
! Arrays for CUDA cuSPARSE CSR format
!   integer       :: m                 IN    ! rows size for square matrix
!   integer       :: n                 IN    ! columns size for square matrix (n=m)
!   integer       :: nz                IN    ! nonzero element of A
!   integer       :: csrRowPtrA (m+1)  IN    ! the array of offsets corresponding to the start of each row
!   integer       :: csrColIndA (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: csrValA    (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: x(m)              INOUT ! initial guess & result
!   real(fp_kind) :: rhs(m)            IN    ! right hand side

! Print CSR Matrix data
!    print *, "csrValA", csrValA
!    print *, "csrRowPtrA",csrRowPtrA
!    print *, "csrColIndA", csrColIndA


! Initilize X
    allocate (x(n))
    x = 0.0_fp_kind      ! or Initial guess X is set

! cuSOLVER SP (Cholesky, LU, QR)

    call cusolverSPs (gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                      tol, index_base)

end program main

sp_solvers.f90

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!

module SP_solvers
    use precision
    implicit none

  contains

    real(fp_kind) function vec_max(n, x)
    implicit none
    integer       :: i, n
    real(fp_kind) :: x(n)
      vec_max = 0.0_fp_kind
      do i = 1, n
      vec_max = max( abs(x(i)), vec_max )
      end do
    end function

    real(fp_kind) function csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
    implicit none
    integer       :: m, n, nz
    integer       :: csrRowPtrA(m+1), csrColIndA(nz)
    real(fp_kind) :: csrValA(nz)
    real(fp_kind) :: sum, norminf
    integer       :: i, j, start, end
    norminf = 0.0_fp_kind

    do i = 1, m
      sum = 0.0_fp_kind
      start = csrRowPtrA(i  )
      end   = csrRowPtrA(i+1) - 1
        do j = start, end
          sum = sum + abs(csrValA(j))
        end do
      norminf = max (norminf, sum)
    end do
    csr_mat_max = norminf
    end function




!=============================================================
    subroutine cusolverSPs(gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                          tol, index_base)
!-------------------------------------------------------------
!   Bi-Conjugate Gradient Stabilized (BiCGStab)
!   A is an n × n sparse matrix defined in CSR storage format
!-------------------------------------------------------------
    use measure_time
    use cusparse
    use cuSOLVER_SP_Solver
    implicit none

! Arrays for CUDA cuSPARSE CSR format
    integer       :: m                ! number of rows of matrix
    integer       :: n                ! number of columns of matrix
    integer       :: nz               ! nonzero elements of A
    integer       :: csrRowPtrA(n+1)  ! the array of offsets corresponding to the start of each row
    integer       :: csrColIndA (nz)  ! sorted by row and by column within each row
    real(fp_kind) :: csrValA    (nz)  ! sorted by row and by column within each row
                                      ! B = Q*A*Q^T
    real(fp_kind) :: x  (n)           ! initial guess & result
    real(fp_kind) :: rhs(n)           ! right hand side
    real(fp_kind) :: b  (n)           ! a copy of rhs
    character(c_char), allocatable :: buffer(:)
    integer       :: index_base

    integer       :: gpuonly          ! 0 = both CPU and GPU, 1 = GPU only
    integer       :: solver           ! 1 = cholesky, 2 = LU, 3 = QR
    integer       :: reorder          ! manula reordering 1 = symrcm, 2 = symamd reordering
    integer       :: singularity      ! -1 if A is invertible under tol
    integer       :: issym
    real(fp_kind) :: rsum, diff, error
    real(fp_kind) :: x_inf, r_inf, A_inf

    real(fp_kind) :: tol              !  to decide if singular or not

! Local variables
    type(cusolverSpHandle)   :: cusolver_H
    type(cusparseHandle)     :: cusparse_H
    type(cusparseMatDescr)   :: descrA
    integer                  :: cusolver_status
    integer                  :: status, ierr_code
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0

    integer                  :: i, j, k
    integer                  :: nprint
! Parameters
    integer, parameter       :: idbg = 0
    real(fp_kind),parameter  :: eps = 6.9d-310     ! machine epsilon

! Initialize
    singularity = 0
    issym   = 0

    if (m /= n) then
      print *,"Error: only support square matrix."
      stop
    end if
!
    print '(/1x,a)',"=================================================================="
    if (solver == 1) then
      print '(1x,a)', "The linear system is solved by sparse CHOLESKY factorization."
    else if (solver == 2) then
      print '(1x,a)', "The linear system is solved by sparse LU factorization."
    else if (solver == 3) then
      print '(1x,a)', "The linear system is solved by sparse QR factorization."
    end if
    if (reorder == 0) then
      print '(1x,a)', "Reordering scheme : No reordering"
    else if (reorder == 1) then
      print '(1x,a)', "Reordering scheme : symrcm"
    else if (reorder == 2) then
      print '(1x,a)', "Reordering scheme : symamd"
    end if
    print '(/1x,a,d15.10,2x,a,i8/)',"Tolerance to decide if singular or not = ", tol

!   set b = rhs, where Ax = rhs (Ax = b)
    b(:) = rhs(:)

    k = 0

! Create cusolver/cusparse handle, and matrix descriptor

    status = cusolverSpCreate(cusolver_H)
    status = cusparseCreate(cusparse_H)
    status = cusparseCreateMatDescr(descrA)
    status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_NON_UNIT)
    status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base=1

! Verify if A has symmetric pattern or not
    status = cusolverSpXcsrissymHost( &
        cusolver_H, m, nz, descrA, csrRowPtrA(1), csrRowPtrA(2), csrColIndA, issym)
    if (issym == 0) then
      print *, "A matrix has no symmmetric pattern."
    else
      print *, "A matrix has symmmetric pattern."
    end if
    print '(1x,a)',"=================================================================="
    if(idbg .eq. 1) print *, "cusolverSpXcsrissymHost-status=", status

! Clock init
    status = init_clock
    if (gpuonly == 1 ) goto 100

! Clock start
       status = start_clock

    if ( solver == 1 ) then                                                 ! cholesky
      status = cusolverSpDcsrlsvcholHost(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .ge. 1)  then
        print *
        print *,"singularity = ", singularity
        print *,"cusolverSpDcsrlsvcholHost-status = ", status
      end if

    else if ( solver == 2 ) then                                            ! LU
      status = cusolverSpDcsrlsvluHost(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .ge. 1)  then
        print *
        print *,"cusolverSpDcsrlsvluHost-status = ", status
        print *,"singularity = ", singularity
      end if

    else if ( solver == 3 ) then                                            ! QR
      status = cusolverSpDcsrlsvqrHost(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .eq. 1) then
        print *
        print *,"cusolverSpDcsrlsvqrHost-status = ", status
        print *,"singularity = ", singularity
      end if
    else
       print *,"Error: Solver",solver," is unknown solver"
    end if

       status = end_clock
       time_tag = "cuSOLVER SP solver result on [ CPU ]"
       status = print_time(time_tag)

!$acc data copy(b, X)  copyin( csrValA, csrRowPtrA, csrColIndA )
  !$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)
    ! evaluate residual b = b - A*x (result on CPU)
    status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
             m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b)
  !$acc end host_data
!$acc end data

    if ( singularity .ge. 0 ) &
    print '(/2x,a,i5,,2x,a, d10.3/)', "WARNING!! : the matrix is singular at row", singularity+1, "under tolerance", tol
    r_inf = vec_max (m, b)
    x_inf = vec_max (n, x)
    A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
    print '(2x,a,d20.13)',"|b - A*x| =", r_inf
    print '(2x,a,d20.13)',"      |A| =", A_inf
    print '(2x,a,d20.13)',"      |x| =", x_inf
    print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf)

! output result
    print *
    nprint = n
    if ( n > 10 ) nprint = 10
    do i = 1, nprint
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do
    print *,""
    do i = n-nprint, n
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do

!   reset b = rhs, where Ax = rhs (Ax = b)
    b(:) = rhs(:)
    x = 0.0_fp_kind  ! reset X

100 continue

! Clock start
      status = start_clock

!$acc data copy(b, X)  copyin( csrValA, csrRowPtrA, csrColIndA )
!$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)

    if ( solver == 1 ) then                                                 ! cholesky
      status = cusolverSpDcsrlsvchol(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .eq. 1)  then
        print *
        print *,"singularity = ", singularity
        print *,"cusolverSpDcsrlsvchol-status = ", status
      end if

    else if ( solver == 2 ) then                                            ! LU
      print '(/a)'," LU SP solver for GPU is not available now."
      print *,"-------------------------------------------"
      stop

    else if ( solver == 3 ) then                                            ! QR
      status = cusolverSpDcsrlsvqr(cusolver_H, m, nz, &
               descrA, csrValA, csrRowPtrA, csrColIndA, &
               b, tol, reorder, x, singularity)
      if(idbg .eq. 1) then
        print *
        print *,"cusolverSpDcsrlsvqr-status = ", status
        print *,"singularity = ", singularity
      end if
    else
       print *,"Error: Solver",solver," is unknown solver"
    end if

!$acc end host_data
!$acc end data

      status = end_clock
      time_tag = "cuSOLVER SP solver result on [ GPU ]"
      status = print_time(time_tag)

!$acc data copy(b, X)  copyin( csrValA, csrRowPtrA, csrColIndA )
  !$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)
    ! evaluate residual b = b - A*x (result on CPU)
    status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
             m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b)
  !$acc end host_data
!$acc end data

    if ( singularity .ge. 0 ) &
      print '(/2x,a,i5,,2x,a, d10.3/)', "Warning!! : the matrix is singular at row", singularity+1, "under tolerance", tol
    r_inf = vec_max (m, b)
    x_inf = vec_max (n, x)
    A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
    print '(2x,a,d20.13)',"|b - A*x| =", r_inf
    print '(2x,a,d20.13)',"      |A| =", A_inf
    print '(2x,a,d20.13)',"      |x| =", x_inf
    print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf)

! output result
    print *
    nprint = n
    if ( n > 10 ) nprint = 10
    do i = 1, nprint
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do
    print *,""
    do i = n-nprint, n
       print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
    end do

    status = cusparseDestroy(cusparse_H)
    status = cusparseDestroyMatDescr(descrA)

    end subroutine cusolverSPs

End module SP_solvers

MTXファイルを使用して実行

　tar file 内に提供した Makefile の記述に沿って、コンパイル＆リンクを行う。なお、この Makefile では、cuda9.0 ツールキットのライブラリを使用するように設定しているが、cuda8.0 を使用しても良い。cusparse と cublas ライブラリは、PGIコンパイラにバンドルしているルーチンを使用する形になっている。

Makefile によるコンパイル

$ make
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  precision.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed  mmio.f
mmread:
    154, Loop not vectorized/parallelized: potential early exits
    199, Loop not vectorized/parallelized: potential early exits
    203, Loop not vectorized/parallelized: potential early exits
    207, Loop not vectorized/parallelized: potential early exits
    212, Loop not vectorized/parallelized: potential early exits
    251, Loop not vectorized/parallelized: potential early exits
    255, Loop not vectorized/parallelized: potential early exits
    259, Loop not vectorized/parallelized: potential early exits
mminfo:
    455, Loop not vectorized/parallelized: potential early exits
mmwrite:
    678, Loop not vectorized/parallelized: contains call
    682, Loop not vectorized/parallelized: contains call
    686, Loop not vectorized/parallelized: contains call
    692, Loop not vectorized/parallelized: contains call
    709, Loop not vectorized/parallelized: contains call
    713, Loop not vectorized/parallelized: contains call
    718, Loop not vectorized/parallelized: contains call
lowerc:
    771, Loop not vectorized/parallelized: contains call
getwd:
    795, Loop not vectorized/parallelized: potential early exits
countwd:
    826, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda  cusolverSP_mod.cuf
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  mm_convert_csr.f90
matrix_printout:
     70, Loop not vectorized/parallelized: contains call
     79, Loop not vectorized/parallelized: contains call
set_matrix:
    140, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    150, Loop not vectorized/parallelized: contains call
    171, Memory copy idiom, loop replaced by call to __c_mcopy4
    172, Memory copy idiom, loop replaced by call to __c_mcopy4
    175, Memory copy idiom, loop replaced by call to __c_mcopy8
    179, Memory copy idiom, loop replaced by call to __c_mcopy4
    183, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    361, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    364, Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    374, Generating copy(csrvala(:),indx(:),rval(:),jndx(:),csrrowptra(:))
    392, Generating enter data copyin(p(:),pbuffer(:))
    411, Generating exit data delete(p(:),pbuffer(:))
    415, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    431, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  sp_solvers.f90
vec_max:
     33, Generated an alternate version of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
csr_mat_max:
     51, Generated an alternate version of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
cusolversps:
    139, Memory copy idiom, loop replaced by call to __c_mcopy8
    207, Generating copyin(csrvala(:))
         Generating copy(x(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    229, Loop not vectorized/parallelized: contains call
    233, Loop not vectorized/parallelized: contains call
    238, Memory copy idiom, loop replaced by call to __c_mcopy8
    245, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:),x(:))
    283, Generating copyin(csrvala(:))
         Generating copy(x(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    305, Loop not vectorized/parallelized: contains call
    309, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  main.f90
main:
     66, Memory set idiom, loop replaced by call to __c_mset8
     98, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.o mmio.o cusolverSP_mod.o measure_time.o mm_convert_csr.o sp_solvers.o main.o
 -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda -Mcudalib=cusparse,cusolver

　サンプル係数行列（mtxデータ）については、前章の説明を参照して欲しい。提供した tar file の中には、サンプルとして lap3D_7pt_n20.mtx 等の mtx 形式の対称行列ファイルを含んでいるので、これをテスト用データして使用してほしい。

　make run コマンドで実行が開始される。入力 mtx ファイルが求められるので、入手した mtx ファイル名を指定する。次に rhs ベクトルファイルが存在する場合は、そのファイル名を指定する。存在しない場合は、enter を押下し、実行に進む。以下の例では、symrcm リオーダリングを施して、コレスキー分解にて計算した結果である。Intel Haswell CPU 上の実行では 4スレッドで 47 秒、1スレッドでは 160 秒の時間が掛かるが、GPU 上では geforce 1080ti という倍精度演算器の極めて少ないボードを使用しても 19 秒程度で終了する。

実行

[kato@photon32]$ export OMP_NUM_THREADS=4　（CPU版の実行では4スレッド並列で実行する）
[kato@photon32]$ make run
$ make run
===== Program run =====
a.out
 0 = CPU and GPU, 1 = GPU only?                   CPU＋ GPU 計算か、 GPUのみの計算？
0
 1 = cholesky, 2 = LU, 3 = QR  ; What solver?     ソルバーの種類選択
1
 0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?　Reordering scheme の選択
1

 =================================================
 Sparse matrix linear solver schem : Cholesky
 Reordering scheme of matrix : symrcm
 =================================================

 File name for Matrix Market?                     行列の入力ファイル名
/home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx

 Matrix Market file name = /home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx

 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =  525825
 cols(n)        =  525825
 Non-zero elm.  = 3674625
 extendSymMatrix=       1


 =======================================================================================
 NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers.
 =======================================================================================

 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.  
                                    ! RHS(右辺ベクトル）の用意がないため、enter

 i,indx,jndx            1            1            1   0.1000003178914388
 i,indx,jndx            2            1       523522  -4.9999841054280604E-002
 i,indx,jndx            3            1       523523  -4.9999841054280604E-002
 i,indx,jndx            4            2            2   0.2000009536743164
 i,indx,jndx            5            2       523905  -4.9999841054280604E-002
 i,indx,jndx            6            2       523906   3.1789143880208332E-007
 i,indx,jndx            7            2       524290  -9.9999682108561208E-002
 i,indx,jndx            8            2       524802  -4.9999841054280604E-002
 i,indx,jndx            9            3            3   0.1000006357828777
 i,indx,jndx           10            3       525057  -4.9999841054280646E-002
 i,indx,jndx      3674615       525823       525823   0.2000009536743135
 i,indx,jndx      3674616       525824        33153  -4.9999841054280535E-002
 i,indx,jndx      3674617       525824       131841  -4.9999841054276260E-002
 i,indx,jndx      3674618       525824       328447  -9.9999682108561472E-002
 i,indx,jndx      3674619       525824       426751   3.1789143884642877E-007
 i,indx,jndx      3674620       525824       525824   0.2000009536743123
 i,indx,jndx      3674621       525825            5  -4.9999841054280604E-002
 i,indx,jndx      3674622       525825       131841  -4.9999841054280694E-002
 i,indx,jndx      3674623       525825       328448   3.1789143862453213E-007
 i,indx,jndx      3674624       525825       525313  -9.9999682108558099E-002
 i,indx,jndx      3674625       525825       525825   0.2000009536743136

 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          4          9         13         16         21         25
         30         37         42
 csrColIndA=
          1     523522     523523          2     523905     523906     524290
     524802          3     525057
 csrValA  =
 0.1000D+00 -.5000D-01 -.5000D-01 0.2000D+00 -.5000D-01 0.3179D-06 -.1000D+00
 -.5000D-01 0.1000D+00 -.5000D-01

 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
    3674576    3674581    3674586    3674591    3674596    3674601    3674606
    3674611    3674616    3674621    3674626
 csrColIndA=
     525823      33153     131841     328447     426751     525824          5
     131841     328448     525313     525825
 csrValA=
 0.2000D+00 -.5000D-01 -.5000D-01 -.1000D+00 0.3179D-06 0.2000D+00 -.5000D-01
 -.5000D-01 0.3179D-06 -.1000D+00 0.2000D+00

 ==================================================================
 The linear system is solved by sparse CHOLESKY factorization.
 Reordering scheme : symrcm

 Tolerance to decide if singular or not = .1000000000D-11
 A matrix has symmmetric pattern.
 ==================================================================
                                                                                    4スレッド並列時
 cuSOLVER SP solver result on [ CPU ]                                               47.714441 (sec)　
 ==================================================================================================
  |b - A*x| = 0.3055902197957D-09
        |A| = 0.8000012715659D+00
        |x| = 0.2638521162680D+06
  |b - A*x|/(|A|*|x|) = 0.1447732059137D-14

 x(      1)=0.263852116268E+06
 x(      2)=0.263093899026E+06
 x(      3)=0.263844899226E+06
 x(      4)=0.263852116268E+06
 x(      5)=0.263093899027E+06
 x(      6)=0.263844899225E+06
 x(      7)=0.263199487139E+06
 x(      8)=0.262666049525E+06
 x(      9)=0.263304916648E+06
 x(     10)=0.262560622769E+06

 x( 525815)=0.263094481758E+06
 x( 525816)=0.263094376198E+06
 x( 525817)=0.263094281209E+06
 x( 525818)=0.263094196778E+06
 x( 525819)=0.263094122892E+06
 x( 525820)=0.263094059539E+06
 x( 525821)=0.263094006710E+06
 x( 525822)=0.263093964396E+06
 x( 525823)=0.263093932591E+06
 x( 525824)=0.263093911290E+06
 x( 525825)=0.263093900489E+06

 cuSOLVER SP solver result on [ GPU ]                                               19.054326 (sec)
 ==================================================================================================
  |b - A*x| = 0.1527951098979D-09
        |A| = 0.8000012715659D+00
        |x| = 0.2638521162680D+06
  |b - A*x|/(|A|*|x|) = 0.7238660295685D-15

 x(      1)=0.263852116268E+06
 x(      2)=0.263093899027E+06
 x(      3)=0.263844899226E+06
 x(      4)=0.263852116268E+06
 x(      5)=0.263093899027E+06
 x(      6)=0.263844899226E+06
 x(      7)=0.263199487139E+06
 x(      8)=0.262666049525E+06
 x(      9)=0.263304916649E+06
 x(     10)=0.262560622769E+06

 x( 525815)=0.263094481758E+06
 x( 525816)=0.263094376198E+06
 x( 525817)=0.263094281209E+06
 x( 525818)=0.263094196778E+06
 x( 525819)=0.263094122892E+06
 x( 525820)=0.263094059539E+06
 x( 525821)=0.263094006710E+06
 x( 525822)=0.263093964396E+06
 x( 525823)=0.263093932591E+06
 x( 525824)=0.263093911290E+06
 x( 525825)=0.263093900490E+06

2. Windows 版プログラムソース

　上記で示したプログラム一式は、Linux 上で動作させることを前提としたものである。同じプログラムを Windows 上で動作させるために Makefile を変更したものとソース・ファイル一式を以下に提供する。以下のファイルをダウンロードしていただきたい。Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI 用の cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。

　プログラム tar ファイル：cuSOLVERsp_solvers.zip

　Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI が提供する cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。また、cuSOLVER SP ルーチンの Windows 版は、CPU 実行用のソルバー・ルーチンに対して、OpenMP 並列マルチスレッド対応がなされていないことである。CPU上のソルバーはシングル・スレッドで実行される。

　以下の例は、PGI 17.10 バージョンの PGI コマンドプロンプト（端末）を使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V

pgfortran 17.10-0 64-bit target on x86-64 Windows -tp sandybridge
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

　また、使用した GPU は以下のとおりである。

$ pgaccelinfo

CUDA Driver Version:           9000

Device Number:                 0
Device Name:                   GeForce GTX 780
Device Revision Number:        3.5
Global Memory Size:            3221225472
Number of Multiprocessors:     12
Number of SP Cores:            2304
Number of DP Cores:            768
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    941 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             3004 MHz
Memory Bus Width:              384 bits
L2 Cache Size:                 1572864 bytes
Max Threads Per SMP:           2048
Async Engines:                 1
Unified Addressing:            Yes
Managed Memory:                Yes
PGI Compiler Option:           -ta=tesla:cc35

PGI$ make
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  precision.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed  mmio.f
mmread:
    154, Loop not vectorized/parallelized: potential early exits
    199, Loop not vectorized/parallelized: potential early exits
    203, Loop not vectorized/parallelized: potential early exits
    207, Loop not vectorized/parallelized: potential early exits
    212, Loop not vectorized/parallelized: potential early exits
    251, Loop not vectorized/parallelized: potential early exits
    255, Loop not vectorized/parallelized: potential early exits
    259, Loop not vectorized/parallelized: potential early exits
mminfo:
    455, Loop not vectorized/parallelized: potential early exits
mmwrite:
    678, Loop not vectorized/parallelized: contains call
    682, Loop not vectorized/parallelized: contains call
    686, Loop not vectorized/parallelized: contains call
    692, Loop not vectorized/parallelized: contains call
    709, Loop not vectorized/parallelized: contains call
    713, Loop not vectorized/parallelized: contains call
    718, Loop not vectorized/parallelized: contains call
lowerc:
    771, Loop not vectorized/parallelized: contains call
getwd:
    795, Loop not vectorized/parallelized: potential early exits
countwd:
    826, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda  cusolverSP_mod.cuf
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  mm_convert_csr.f90
matrix_printout:
     70, Loop not vectorized/parallelized: contains call
     79, Loop not vectorized/parallelized: contains call
set_matrix:
    140, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    150, Loop not vectorized/parallelized: contains call
    171, Memory copy idiom, loop replaced by call to __c_mcopy4
    172, Memory copy idiom, loop replaced by call to __c_mcopy4
    175, Memory copy idiom, loop replaced by call to __c_mcopy8
    179, Memory copy idiom, loop replaced by call to __c_mcopy4
    183, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    361, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    364, Generated 3 alternate versions of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    374, Generating copy(csrvala(:),indx(:),jndx(:),rval(:),csrrowptra(:))
    392, Generating enter data copyin(p(:),pbuffer(:))
    411, Generating exit data delete(pbuffer(:),p(:))
    415, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    431, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  sp_solvers.f90
vec_max:
     33, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
csr_mat_max:
     51, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
cusolversps:
    139, Memory copy idiom, loop replaced by call to __c_mcopy8
    207, Generating copy(x(:))
         Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    229, Loop not vectorized/parallelized: contains call
    233, Loop not vectorized/parallelized: contains call
    238, Memory copy idiom, loop replaced by call to __c_mcopy8
    239, Memory zero idiom, loop replaced by call to __c_mzero8
    246, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:),x(:))
    284, Generating copy(x(:))
         Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
         Generating copy(b(:))
    306, Loop not vectorized/parallelized: contains call
    310, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda  main.f90
main:
     66, Memory set idiom, loop replaced by call to __c_mset8
     98, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.obj mmio.obj cusolverSP_mod.obj measure_time.obj mm_convert_csr.obj sp_solvers.obj main.obj -acc -ta=tesla,
cuda9.0,cc35,cc60 -Mcuda -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64"   -Mcudalib=cusparse,cusolver

PGI$ make run
===== Program run =====
a.out
 0 = CPU and GPU, 1 = GPU only?
0
 1 = cholesky, 2 = LU, 3 = QR  ; What solver?
3
 0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?
1

 =================================================
 Sparse matrix linear solver schem : QR
 Reordering scheme of matrix : symrcm
 =================================================

 File name for Matrix Market?
lap3D_7pt_n20.mtx

 Matrix Market file name = lap3D_7pt_n20.mtx

 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =    8000
 cols(n)        =    8000
 Non-zero elm.  =   53600
 extendSymMatrix=       1


 =======================================================================================
 NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers.
 =======================================================================================

 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.


 i,indx,jndx            1            1            1    6.000000000000000
 i,indx,jndx            2            1            2   -1.000000000000000
 i,indx,jndx            3            1           21   -1.000000000000000
 i,indx,jndx            4            1          401   -1.000000000000000
 i,indx,jndx            5            2            1   -1.000000000000000
 i,indx,jndx            6            2            2    6.000000000000000
 i,indx,jndx            7            2            3   -1.000000000000000
 i,indx,jndx            8            2           22   -1.000000000000000
 i,indx,jndx            9            2          402   -1.000000000000000
 i,indx,jndx           10            3            2   -1.000000000000000
 i,indx,jndx        53590         7998         7998    6.000000000000000
 i,indx,jndx        53591         7998         7999   -1.000000000000000
 i,indx,jndx        53592         7999         7599   -1.000000000000000
 i,indx,jndx        53593         7999         7979   -1.000000000000000
 i,indx,jndx        53594         7999         7998   -1.000000000000000
 i,indx,jndx        53595         7999         7999    6.000000000000000
 i,indx,jndx        53596         7999         8000   -1.000000000000000
 i,indx,jndx        53597         8000         7600   -1.000000000000000
 i,indx,jndx        53598         8000         7980   -1.000000000000000
 i,indx,jndx        53599         8000         7999   -1.000000000000000
 i,indx,jndx        53600         8000         8000    6.000000000000000

 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          5         10         15         20         25         30
         35         40         45
 csrColIndA=
          1          2         21        401          1          2          3
         22        402          2
 csrValA  =
 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01
 -.1000D+01 -.1000D+01 -.1000D+01

 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
      53552      53557      53562      53567      53572      53577      53582
      53587      53592      53597      53601
 csrColIndA=
       7998       7999       7599       7979       7998       7999       8000
       7600       7980       7999       8000
 csrValA=
 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01
 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01

 ==================================================================
 The linear system is solved by sparse QR factorization.
 Reordering scheme : symrcm

 Tolerance to decide if singular or not = .1000000000D-11
 A matrix has symmmetric pattern.
 ==================================================================

 cuSOLVER SP solver result on [ CPU ]                                                2.355999 (sec)
 ==================================================================================================
  |b - A*x| = 0.5115907697473D-12
        |A| = 0.1200000000000D+02
        |x| = 0.2458019372584D+02
  |b - A*x|/(|A|*|x|) = 0.1734427507818D-14

 x(      1)=0.666997357463E+00
 x(      2)=0.100066138159E+01
 x(      3)=0.119691025971E+01
 x(      4)=0.132458774155E+01
 x(      5)=0.141248710662E+01
 x(      6)=0.147457380480E+01
 x(      7)=0.151841489418E+01
 x(      8)=0.154840911171E+01
 x(      9)=0.156717865908E+01
 x(     10)=0.157622105444E+01

 x(   7990)=0.157622105444E+01
 x(   7991)=0.157622105444E+01
 x(   7992)=0.156717865908E+01
 x(   7993)=0.154840911171E+01
 x(   7994)=0.151841489418E+01
 x(   7995)=0.147457380480E+01
 x(   7996)=0.141248710662E+01
 x(   7997)=0.132458774155E+01
 x(   7998)=0.119691025971E+01
 x(   7999)=0.100066138159E+01
 x(   8000)=0.666997357463E+00

 cuSOLVER SP solver result on [ GPU ]                                                0.683000 (sec)
 ==================================================================================================
  |b - A*x| = 0.4547473508865D-12
        |A| = 0.1200000000000D+02
        |x| = 0.2458019372584D+02
  |b - A*x|/(|A|*|x|) = 0.1541713340283D-14

 x(      1)=0.666997357463E+00
 x(      2)=0.100066138159E+01
 x(      3)=0.119691025971E+01
 x(      4)=0.132458774155E+01
 x(      5)=0.141248710662E+01
 x(      6)=0.147457380480E+01
 x(      7)=0.151841489418E+01
 x(      8)=0.154840911171E+01
 x(      9)=0.156717865908E+01
 x(     10)=0.157622105444E+01

 x(   7990)=0.157622105444E+01
 x(   7991)=0.157622105444E+01
 x(   7992)=0.156717865908E+01
 x(   7993)=0.154840911171E+01
 x(   7994)=0.151841489418E+01
 x(   7995)=0.147457380480E+01
 x(   7996)=0.141248710662E+01
 x(   7997)=0.132458774155E+01
 x(   7998)=0.119691025971E+01
 x(   7999)=0.100066138159E+01
 x(   8000)=0.666997357463E+00

ページの先頭へ