前章 12-3 では、OpenACC + cuSPARSE を利用した Bi-Conjugate Gradient Stabilized(BiCGStab)プログラムを用いて、スパース行列に対する「反復法による線形方程式」を解く例を示した。この章では、スパース行列を対象とした「直接法による線形方程式」を解くプログラム例を示す。いわゆる、スパース行列用の LAPACK ルーチンである。これは、cuSOLVER SP ライブラリの関数を使用することにより実現する。OpenACC を利用しているため、極めて少ない工数で、GPUアクセラレータを使用した計算が可能となる。ここでは、Fortran プログラムの例を示すこととする。ここで例示するソース・ファイルと Makefile 一式をアーカイブした tar ファイルを提供する。以下のファイルをダウンロードしていただきたい。
プログラム tar ファイル:openacc_cuSOLVER-SP.tar.gz
以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.59 Wed Jul 19 23:53:34 PDT 2017 Device Number: 0 Device Name: GeForce GTX 1080 Ti Device Revision Number: 6.1 Global Memory Size: 11714691072 Number of Multiprocessors: 28 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 1657 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 5505 MHz Memory Bus Width: 352 bits L2 Cache Size: 2883584 bytes Max Threads Per SMP: 2048 Async Engines: 2 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc60
COO フォーマットから CSR フォーマットへの変換プログラムの概要に関しては、前章の説明を参照していただきたい。
cuSOLVER SP ルーチンを使用する上での注意点を述べる。一般に対称行列として、行列要素が上三角行列のみデータとして収納されているような場合は、必ず、下三角行列要素も収納する形態に拡張して、cuSOLVER SP ルーチンに入力する必要がある。そこで、mm_convert_csr.f90 プログラムは、Matrix Market フォーマットの「symmetric」属性を有する対称行列要素を「一般行列」に拡張するようにしている。プログラム内で、対称行列を一般行列に拡張するためのフラグ変数は extendSymMatrix であり、これを1にセットすることで制御している。
入力データとして必要なスパース行列の CSR フォーマットの詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。
| nz | 整数 | 行列内の非ゼロ要素の数 |
| csrValA | 配列 | 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format. |
| csrRowPtrA | 配列 | 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる。ここでは 1 を使用する。 |
| csrColIndA | 配列 | 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列 |
NVIDIA cuSOLVER SP ルーチンの技術的なドキュメントは、こちらのページを参照して欲しい。cuSolverSP は、スパース QR 分解に基づく直接法による新しいスパースルーチンのセットを提供している。また、その他に、コレスキー分解による解法、LU分解による解法ルーチンも提供している。今回のプログラムでは、この3種のソルバーが動作するように作成してある。数値計算シミュレーションにおいては、全ての行列が factorization の並列性に優れたスパース性のパターンを有している訳ではないため、cuSolverSP ライブラリは、シーケンシャルのような行列を扱うことができるように CPU 上で動作するルーチンも提供している(なお、LU 分解用のルーチンは CPU 用のみであり、GPU用のルーチンは CUDA 9.0 の段階ではまだ提供されていない)。 高い並列性を持つマトリックスを対象とした場合は、GPU 用のルーチンはより高いパフォーマンスを提供することができる。なお、LU 分解を除く CPU 用のコレスキー分解と QR 分解のルーチンは、OpenMP による並列プログラムとなっており、マルチコアによる並列実行が可能である。
cuSOLVER SP ルーチンには、行列要素のリオーダリング(再順序付け)機能を備えており、関数の引数により制御可能となっている。分解におけるパフォーマンスに大きな影響を与える fill-in の発生を減らすために、symrcm とsymamd の2つの並べ替えスキームを提供している。 引数である入力パラメータ reorder は、reorder が 1 の場合に symrcm(Symmetric Reverse Cuthill-McKee置換)、2 の場合 symamd (Symmetric Approximate Minimum Degree Algorithm based on Quotient Graph) を有効にする。それ以外の場合、並べ替えは実行されない。
分解時に行列が singular の状態になった場合は、それが発生した最小の行インデックスを出力し、WARNING を出す。その計算結果は不正となる。
ここでは、今回使用しているスパース行列用の線形方程式解法ルーチンのサマライズを以下に纏める。提供プログラムの sp_solvers.f90 ルーチンがドライバールーチンとなる。
| solver No. (solver) |
cuSOLVER関数名 | CPU用 | GPU用 | Reordering (reorder) |
| 1 | コレスキー分解 cusolverSpDcsrlsvchol[Host] |
有り OMP_NUM_THREADSで 並列スレッド数指定 |
有り | No reordering = 0 symrcm reordering = 1 symamd reordering = 2 |
| 2 | LU分解 cusolverSpDcsrlsvlu[Host] |
有り シングルスレッドのみ |
なし | |
| 3 | QR分解 cusolverSpDcsrlsvqr[Host] |
有り OMP_NUM_THREADSで 並列スレッド数指定 |
有り |
PGI 17.10 時点では CUDA C/C++ 言語で開発されている cuSOLVER SP(スパース)系のライブラリへの Fortran Interface Module が提供されていないため、以下のように自作(cusolverSP_mod.cuf)した。C と Fortran のバインディングを取り持つためのインタフェースである。将来的に、PGI コンパイルシステムの中に、cuSOLVER SP の Fortran MODULE (多分 use cusplverSP で使用できるはず)が提供された場合は、以下の cusolverSP_mod.cuf ルーチンは必要なくなる。
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module cuSOLVER_SP_Solver
use iso_c_binding
use cudafor
use cusparse
use cusolver_common
implicit none
! enum, bind(C) ! cusolverStatus_t
! enumerator :: CUSOLVER_STATUS_SUCCESS=0
! enumerator :: CUSOLVER_STATUS_NOT_INITIALIZED=1
! enumerator :: CUSOLVER_STATUS_ALLOC_FAILED=2
! enumerator :: CUSOLVER_STATUS_INVALID_VALUE=3
! enumerator :: CUSOLVER_STATUS_ARCH_MISMATCH=4
! enumerator :: CUSOLVER_STATUS_MAPPING_ERROR=5
! enumerator :: CUSOLVER_STATUS_EXECUTION_FAILED=6
! enumerator :: CUSOLVER_STATUS_INTERNAL_ERROR=7
! enumerator :: CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED=8
! enumerator :: CUSOLVER_STATUS_NOT_SUPPORTED = 9
! enumerator :: CUSOLVER_STATUS_ZERO_PIVOT=10
! enumerator :: CUSOLVER_STATUS_INVALID_LICENSE=11
! end enum
type cusolverSpHandle
type(c_ptr) :: handle
end type cusolverSpHandle
! cusolverSpCreate
interface
integer(c_int) function cusolverSpCreate(handle) bind(C,name='cusolverSpCreate')
import cusolverSpHandle
type(cusolverSpHandle) :: handle
end function cusolverSpCreate
end interface
! cusolverSpDestroy
interface
integer(c_int) function cusolverSpDestroy(handle) bind(C,name='cusolverSpDestroy')
import cusolverSpHandle
type(cusolverSpHandle), value :: handle
end function cusolverSpDestroy
end interface
! cusolverSpXcsrissymHost
interface
integer(c_int) function cusolverSpXcsrissymHost(handle, m, nnzA, descrA, &
csrRowPtrA, csrEndPtrA, csrColIndA, issym) bind(C,name='cusolverSpXcsrissymHost')
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusparseMatDescr), value :: descrA
type(cusolverSpHandle), value :: handle
integer(c_int), value :: m, nnzA
integer(c_int) :: issym
integer(c_int) :: csrRowPtrA(*), csrEndPtrA(*), csrColIndA(*)
end function cusolverSpXcsrissymHost
end interface
! cusolverSpXcsrsymrcmHost
interface
integer(c_int) function cusolverSpXcsrsymrcmHost(handle, m, nnzA, descrA, &
csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymrcmHost')
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusparseMatDescr), value :: descrA
type(cusolverSpHandle), value :: handle
integer(c_int), value :: m, nnzA
integer(c_int) :: q(*)
integer(c_int) :: csrRowPtrA(*), csrColIndA(*)
end function cusolverSpXcsrsymrcmHost
end interface
! cusolverSpXcsrsymmdqHost
interface
integer(c_int) function cusolverSpXcsrsymmdqHost(handle, m, nnzA, descrA, &
csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymmdqHost')
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusparseMatDescr), value :: descrA
type(cusolverSpHandle), value :: handle
integer(c_int), value :: m, nnzA
integer(c_int) :: q(*)
integer(c_int) :: csrRowPtrA(*), csrColIndA(*)
end function cusolverSpXcsrsymmdqHost
end interface
! cusolverSpXcsrperm_bufferSizeHost
interface
integer(c_int) function cusolverSpXcsrperm_bufferSizeHost(handle, m, n, nnzA, &
descrA, csrRowPtrA, csrColIndA, p, q, bufferSizeInBytes ) &
bind(C,name="cusolverSpXcsrperm_bufferSizeHost")
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusolverSpHandle), value :: handle
type(cusparseMatDescr), value :: descrA
integer(c_int), value :: m, n, nnzA
integer(c_int) :: p(*), q(*)
integer(c_int) :: csrRowPtrA(*), csrColIndA(*)
integer(c_long) :: bufferSizeInBytes
end function cusolverSpXcsrperm_bufferSizeHost
end interface
! cusolverSpXcsrpermHost
interface
integer(c_int) function cusolverSpXcsrpermHost(handle, m, n, nnzA, &
descrA, csrRowPtrA, csrColIndA, p, q, map, buffer ) &
bind(C,name="cusolverSpXcsrpermHost")
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusolverSpHandle), value :: handle
type(cusparseMatDescr), value :: descrA
integer(c_int), value :: m, n, nnzA
integer(c_int) :: p(*), q(*)
integer(c_int) :: csrRowPtrA(*), csrColIndA(*)
integer(c_int) :: map(*)
character(c_char) :: buffer(*)
end function cusolverSpXcsrpermHost
end interface
! cusolverSpDcsrlsvcholHost
interface
integer(c_int) function cusolverSpDcsrlsvcholHost( handle, m, nnzA, &
descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
bind(C,name="cusolverSpDcsrlsvcholHost")
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusolverSpHandle), value :: handle
type(cusparseMatDescr), value :: descrA
integer(c_int), value :: m, nnzA
real(c_double), intent(in) :: csrValA(:)
integer(c_int), intent(in) :: csrRowPtrA(:), csrColIndA(:)
real(c_double), intent(in) :: b(*)
real(c_double) :: x(*)
real(c_double), value :: tol
integer(c_int), value :: reorder
integer(c_int) :: singularity
end function cusolverSpDcsrlsvcholHost
end interface
! cusolverSpDcsrlsvluHost
interface
integer(c_int) function cusolverSpDcsrlsvluHost( handle, m, nnzA, &
descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
bind(C,name="cusolverSpDcsrlsvluHost")
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusolverSpHandle), value :: handle
type(cusparseMatDescr), value, intent(in) :: descrA
integer(c_int), value :: m, nnzA
real(c_double), intent(in) :: csrValA(*)
integer(c_int), intent(in) :: csrRowPtrA(*), csrColIndA(*)
real(c_double), intent(in) :: b(*)
real(c_double) :: x(*)
real(c_double), value :: tol
integer(c_int), value :: reorder
integer(c_int) :: singularity
end function cusolverSpDcsrlsvluHost
end interface
! cusolverSpDcsrlsvqrHost
interface
integer(c_int) function cusolverSpDcsrlsvqrHost( handle, m, nnzA, &
descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
bind(C,name="cusolverSpDcsrlsvqrHost")
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusolverSpHandle), value :: handle
type(cusparseMatDescr), value, intent(in) :: descrA
integer(c_int), value :: m, nnzA
real(c_double), intent(in) :: csrValA(*)
integer(c_int), intent(in) :: csrRowPtrA(*), csrColIndA(*)
real(c_double), intent(in) :: b(*)
real(c_double) :: x(*)
real(c_double), value :: tol
integer(c_int), value :: reorder
integer(c_int) :: singularity
end function cusolverSpDcsrlsvqrHost
end interface
! cusolverSpDcsrlsvchol
interface
integer(c_int) function cusolverSpDcsrlsvchol( handle, m, nnzA, &
descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
bind(C,name="cusolverSpDcsrlsvchol")
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusolverSpHandle), value :: handle
type(cusparseMatDescr), value :: descrA
integer(c_int), value :: m, nnzA
real(c_double), device, intent(in) :: csrValA(*)
integer(c_int), device, intent(in) :: csrRowPtrA(*), csrColIndA(*)
real(c_double), device, intent(in) :: b(*)
real(c_double), device :: x(*)
real(c_double), value :: tol
integer(c_int), value :: reorder
integer(c_int) :: singularity
end function cusolverSpDcsrlsvchol
end interface
! cusolverSpDcsrlsvqr
interface
integer(c_int) function cusolverSpDcsrlsvqr( handle, m, nnzA, &
descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) &
bind(C,name="cusolverSpDcsrlsvqr")
use iso_c_binding
import cusolverSpHandle, cusparseMatDescr
type(cusolverSpHandle), value :: handle
type(cusparseMatDescr), value :: descrA
integer(c_int), value :: m, nnzA
real(c_double), device, intent(in) :: csrValA(*)
integer(c_int), device, intent(in) :: csrRowPtrA(*), csrColIndA(*)
real(c_double), device, intent(in) :: b(*)
real(c_double), device :: x(*)
real(c_double), value :: tol
integer(c_int), value :: reorder
integer(c_int) :: singularity
end function cusolverSpDcsrlsvqr
end interface
end module cuSOLVER_SP_Solver
以下に、メインプログラムを示す。このプログラムは、複素数の係数行列は対応していない。倍精度実数型のデータのみを扱える。最初に、Matrix Market 形式のファイル名を指定し、次に right hand side(rhs) データファイルがあれば、それを指定する。もし、存在しなければ、rhs ベクトルは全て 1.0 として計算する。入力データの読み込み後、coo2csr_convert ルーチンを call して、COO 形式データからCSR 形式データに変換する。変換後、csrRowPtrA、csrColIndA、csrValA 配列が作成され、この CSR 形式の行列を実引数として、cusolverSPs ドライバールーチンに渡し、計算が始まる。このプログラムとは別に、ユーザ側のプログラムで、csrRowPtrA、csrColIndA、csrValA 配列を用意できれば、cusolverSPs ドライバールーチンを使用して線形方程式を解くことができる。
program main
use precision
use Matrix_data
use mm_coo_cuda_csr_convert
use SP_solvers
implicit none
integer :: i, j, istat
integer,parameter :: iunit1 = 10 ! for Matrix Market data (COO)
integer,parameter :: iunit2 = 11 ! for LHS, if available
character*80 :: mm_file, rhs_file
integer :: extendSymMatrix ! 1 = symmetrize the pattern (transform from triangular to full)
integer :: gpuonly ! 0 = both CPU and GPU, 1 = GPU only
integer :: solver ! 1 = cholesky, 2 = LU, 3 = QR
integer :: reorder ! Reordering ; 0 = no, 1 = symrcm, 2 = symamd reordering
! real(fp_kind), parameter :: tol = 0.0d0 ! tolerance to decide if singular or not
real(fp_kind), parameter :: tol = 1.0d-12 ! tolerance to decide if singular or not
integer, parameter :: index_base = 1 ! CSR format -- one based index
! Kind of Solver, a scheme of reordering
print *, "0 = CPU and GPU, 1 = GPU only?"
read (5, '(i1)') gpuonly
print *, "1 = cholesky, 2 = LU, 3 = QR ; What solver?"
read (5, '(i1)') solver
if ( .not. (solver == 1 .or. solver == 2 .or. solver == 3) ) solver = 1
print *, "0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?"
read (5, '(i1)') reorder
print '(/1x,a)', "================================================="
if ( solver == 1) print *, "Sparse matrix linear solver schem : Cholesky"
if ( solver == 2) print *, "Sparse matrix linear solver schem : LU"
if ( solver == 3) print *, "Sparse matrix linear solver schem : QR"
if ( reorder == 1 ) then
print *, "Reordering scheme of matrix : symrcm"
else if ( reorder == 2 ) then
print *, "Reordering scheme of matrix : symamd"
else
reorder = 0
print *, "No reordering of matrix"
end if
print '(1x,a/)', "================================================="
! Read matrix Market data file
extendSymMatrix = 1 ! if =1 .and. symmetric matrix=yes , transform from triangular to full
print *, "File name for Matrix Market?"
read (5, '(a)') mm_file
! mm_file ="lap2D_5pt_n100.mtx"
print '(/1x,a,a)', "Matrix Market file name = ",mm_file
open (iunit1, file=mm_file, status='old')
print *
call mm_matrix_read(iunit1, extendSymMatrix)
! Read right hand side vectors, if available.
print *, "File name for LHS vector? if no available, press ENTER and set as RHS=1.0."
print *
read (5, '(a)') rhs_file
if (rhs_file == "") then
allocate (rhs(rows))
rhs= 1.0_fp_kind
else
print *, "rhs_file name = ",rhs_file
open (iunit2, file=rhs_file, status='old')
call mm_rhs_read(iunit2)
end if
! Convert COO to CSR sparse matrix
! INPUT : Matrix COO data from Matrix Market format via mm_rhs_read routine
! OUTPUT : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data
call coo2csr_convert
! CSR sparse matrix data are set. The belows are arguments for cusplverSPs routine.
! Arrays for CUDA cuSPARSE CSR format
! integer :: m IN ! rows size for square matrix
! integer :: n IN ! columns size for square matrix (n=m)
! integer :: nz IN ! nonzero element of A
! integer :: csrRowPtrA (m+1) IN ! the array of offsets corresponding to the start of each row
! integer :: csrColIndA (nz) IN ! sorted by row and by column within each row
! real(fp_kind) :: csrValA (nz) IN ! sorted by row and by column within each row
! real(fp_kind) :: x(m) INOUT ! initial guess & result
! real(fp_kind) :: rhs(m) IN ! right hand side
! Print CSR Matrix data
! print *, "csrValA", csrValA
! print *, "csrRowPtrA",csrRowPtrA
! print *, "csrColIndA", csrColIndA
! Initilize X
allocate (x(n))
x = 0.0_fp_kind ! or Initial guess X is set
! cuSOLVER SP (Cholesky, LU, QR)
call cusolverSPs (gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
tol, index_base)
end program main
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module SP_solvers
use precision
implicit none
contains
real(fp_kind) function vec_max(n, x)
implicit none
integer :: i, n
real(fp_kind) :: x(n)
vec_max = 0.0_fp_kind
do i = 1, n
vec_max = max( abs(x(i)), vec_max )
end do
end function
real(fp_kind) function csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
implicit none
integer :: m, n, nz
integer :: csrRowPtrA(m+1), csrColIndA(nz)
real(fp_kind) :: csrValA(nz)
real(fp_kind) :: sum, norminf
integer :: i, j, start, end
norminf = 0.0_fp_kind
do i = 1, m
sum = 0.0_fp_kind
start = csrRowPtrA(i )
end = csrRowPtrA(i+1) - 1
do j = start, end
sum = sum + abs(csrValA(j))
end do
norminf = max (norminf, sum)
end do
csr_mat_max = norminf
end function
!=============================================================
subroutine cusolverSPs(gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
tol, index_base)
!-------------------------------------------------------------
! Bi-Conjugate Gradient Stabilized (BiCGStab)
! A is an n × n sparse matrix defined in CSR storage format
!-------------------------------------------------------------
use measure_time
use cusparse
use cuSOLVER_SP_Solver
implicit none
! Arrays for CUDA cuSPARSE CSR format
integer :: m ! number of rows of matrix
integer :: n ! number of columns of matrix
integer :: nz ! nonzero elements of A
integer :: csrRowPtrA(n+1) ! the array of offsets corresponding to the start of each row
integer :: csrColIndA (nz) ! sorted by row and by column within each row
real(fp_kind) :: csrValA (nz) ! sorted by row and by column within each row
! B = Q*A*Q^T
real(fp_kind) :: x (n) ! initial guess & result
real(fp_kind) :: rhs(n) ! right hand side
real(fp_kind) :: b (n) ! a copy of rhs
character(c_char), allocatable :: buffer(:)
integer :: index_base
integer :: gpuonly ! 0 = both CPU and GPU, 1 = GPU only
integer :: solver ! 1 = cholesky, 2 = LU, 3 = QR
integer :: reorder ! manula reordering 1 = symrcm, 2 = symamd reordering
integer :: singularity ! -1 if A is invertible under tol
integer :: issym
real(fp_kind) :: rsum, diff, error
real(fp_kind) :: x_inf, r_inf, A_inf
real(fp_kind) :: tol ! to decide if singular or not
! Local variables
type(cusolverSpHandle) :: cusolver_H
type(cusparseHandle) :: cusparse_H
type(cusparseMatDescr) :: descrA
integer :: cusolver_status
integer :: status, ierr_code
real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0
integer :: i, j, k
integer :: nprint
! Parameters
integer, parameter :: idbg = 0
real(fp_kind),parameter :: eps = 6.9d-310 ! machine epsilon
! Initialize
singularity = 0
issym = 0
if (m /= n) then
print *,"Error: only support square matrix."
stop
end if
!
print '(/1x,a)',"=================================================================="
if (solver == 1) then
print '(1x,a)', "The linear system is solved by sparse CHOLESKY factorization."
else if (solver == 2) then
print '(1x,a)', "The linear system is solved by sparse LU factorization."
else if (solver == 3) then
print '(1x,a)', "The linear system is solved by sparse QR factorization."
end if
if (reorder == 0) then
print '(1x,a)', "Reordering scheme : No reordering"
else if (reorder == 1) then
print '(1x,a)', "Reordering scheme : symrcm"
else if (reorder == 2) then
print '(1x,a)', "Reordering scheme : symamd"
end if
print '(/1x,a,d15.10,2x,a,i8/)',"Tolerance to decide if singular or not = ", tol
! set b = rhs, where Ax = rhs (Ax = b)
b(:) = rhs(:)
k = 0
! Create cusolver/cusparse handle, and matrix descriptor
status = cusolverSpCreate(cusolver_H)
status = cusparseCreate(cusparse_H)
status = cusparseCreateMatDescr(descrA)
status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)
status = cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_NON_UNIT)
status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base=1
! Verify if A has symmetric pattern or not
status = cusolverSpXcsrissymHost( &
cusolver_H, m, nz, descrA, csrRowPtrA(1), csrRowPtrA(2), csrColIndA, issym)
if (issym == 0) then
print *, "A matrix has no symmmetric pattern."
else
print *, "A matrix has symmmetric pattern."
end if
print '(1x,a)',"=================================================================="
if(idbg .eq. 1) print *, "cusolverSpXcsrissymHost-status=", status
! Clock init
status = init_clock
if (gpuonly == 1 ) goto 100
! Clock start
status = start_clock
if ( solver == 1 ) then ! cholesky
status = cusolverSpDcsrlsvcholHost(cusolver_H, m, nz, &
descrA, csrValA, csrRowPtrA, csrColIndA, &
b, tol, reorder, x, singularity)
if(idbg .ge. 1) then
print *
print *,"singularity = ", singularity
print *,"cusolverSpDcsrlsvcholHost-status = ", status
end if
else if ( solver == 2 ) then ! LU
status = cusolverSpDcsrlsvluHost(cusolver_H, m, nz, &
descrA, csrValA, csrRowPtrA, csrColIndA, &
b, tol, reorder, x, singularity)
if(idbg .ge. 1) then
print *
print *,"cusolverSpDcsrlsvluHost-status = ", status
print *,"singularity = ", singularity
end if
else if ( solver == 3 ) then ! QR
status = cusolverSpDcsrlsvqrHost(cusolver_H, m, nz, &
descrA, csrValA, csrRowPtrA, csrColIndA, &
b, tol, reorder, x, singularity)
if(idbg .eq. 1) then
print *
print *,"cusolverSpDcsrlsvqrHost-status = ", status
print *,"singularity = ", singularity
end if
else
print *,"Error: Solver",solver," is unknown solver"
end if
status = end_clock
time_tag = "cuSOLVER SP solver result on [ CPU ]"
status = print_time(time_tag)
!$acc data copy(b, X) copyin( csrValA, csrRowPtrA, csrColIndA )
!$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)
! evaluate residual b = b - A*x (result on CPU)
status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b)
!$acc end host_data
!$acc end data
if ( singularity .ge. 0 ) &
print '(/2x,a,i5,,2x,a, d10.3/)', "WARNING!! : the matrix is singular at row", singularity+1, "under tolerance", tol
r_inf = vec_max (m, b)
x_inf = vec_max (n, x)
A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
print '(2x,a,d20.13)',"|b - A*x| =", r_inf
print '(2x,a,d20.13)'," |A| =", A_inf
print '(2x,a,d20.13)'," |x| =", x_inf
print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf)
! output result
print *
nprint = n
if ( n > 10 ) nprint = 10
do i = 1, nprint
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
print *,""
do i = n-nprint, n
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
! reset b = rhs, where Ax = rhs (Ax = b)
b(:) = rhs(:)
x = 0.0_fp_kind ! reset X
100 continue
! Clock start
status = start_clock
!$acc data copy(b, X) copyin( csrValA, csrRowPtrA, csrColIndA )
!$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)
if ( solver == 1 ) then ! cholesky
status = cusolverSpDcsrlsvchol(cusolver_H, m, nz, &
descrA, csrValA, csrRowPtrA, csrColIndA, &
b, tol, reorder, x, singularity)
if(idbg .eq. 1) then
print *
print *,"singularity = ", singularity
print *,"cusolverSpDcsrlsvchol-status = ", status
end if
else if ( solver == 2 ) then ! LU
print '(/a)'," LU SP solver for GPU is not available now."
print *,"-------------------------------------------"
stop
else if ( solver == 3 ) then ! QR
status = cusolverSpDcsrlsvqr(cusolver_H, m, nz, &
descrA, csrValA, csrRowPtrA, csrColIndA, &
b, tol, reorder, x, singularity)
if(idbg .eq. 1) then
print *
print *,"cusolverSpDcsrlsvqr-status = ", status
print *,"singularity = ", singularity
end if
else
print *,"Error: Solver",solver," is unknown solver"
end if
!$acc end host_data
!$acc end data
status = end_clock
time_tag = "cuSOLVER SP solver result on [ GPU ]"
status = print_time(time_tag)
!$acc data copy(b, X) copyin( csrValA, csrRowPtrA, csrColIndA )
!$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA)
! evaluate residual b = b - A*x (result on CPU)
status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b)
!$acc end host_data
!$acc end data
if ( singularity .ge. 0 ) &
print '(/2x,a,i5,,2x,a, d10.3/)', "Warning!! : the matrix is singular at row", singularity+1, "under tolerance", tol
r_inf = vec_max (m, b)
x_inf = vec_max (n, x)
A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA)
print '(2x,a,d20.13)',"|b - A*x| =", r_inf
print '(2x,a,d20.13)'," |A| =", A_inf
print '(2x,a,d20.13)'," |x| =", x_inf
print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf)
! output result
print *
nprint = n
if ( n > 10 ) nprint = 10
do i = 1, nprint
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
print *,""
do i = n-nprint, n
print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
end do
status = cusparseDestroy(cusparse_H)
status = cusparseDestroyMatDescr(descrA)
end subroutine cusolverSPs
End module SP_solvers
tar file 内に提供した Makefile の記述に沿って、コンパイル&リンクを行う。なお、この Makefile では、cuda9.0 ツールキットのライブラリを使用するように設定しているが、cuda8.0 を使用しても良い。cusparse と cublas ライブラリは、PGIコンパイラにバンドルしているルーチンを使用する形になっている。
$ make
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda precision.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed mmio.f
mmread:
154, Loop not vectorized/parallelized: potential early exits
199, Loop not vectorized/parallelized: potential early exits
203, Loop not vectorized/parallelized: potential early exits
207, Loop not vectorized/parallelized: potential early exits
212, Loop not vectorized/parallelized: potential early exits
251, Loop not vectorized/parallelized: potential early exits
255, Loop not vectorized/parallelized: potential early exits
259, Loop not vectorized/parallelized: potential early exits
mminfo:
455, Loop not vectorized/parallelized: potential early exits
mmwrite:
678, Loop not vectorized/parallelized: contains call
682, Loop not vectorized/parallelized: contains call
686, Loop not vectorized/parallelized: contains call
692, Loop not vectorized/parallelized: contains call
709, Loop not vectorized/parallelized: contains call
713, Loop not vectorized/parallelized: contains call
718, Loop not vectorized/parallelized: contains call
lowerc:
771, Loop not vectorized/parallelized: contains call
getwd:
795, Loop not vectorized/parallelized: potential early exits
countwd:
826, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda cusolverSP_mod.cuf
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda mm_convert_csr.f90
matrix_printout:
70, Loop not vectorized/parallelized: contains call
79, Loop not vectorized/parallelized: contains call
set_matrix:
140, Unrolled inner loop 8 times
Generated 2 prefetch instructions for the loop
150, Loop not vectorized/parallelized: contains call
171, Memory copy idiom, loop replaced by call to __c_mcopy4
172, Memory copy idiom, loop replaced by call to __c_mcopy4
175, Memory copy idiom, loop replaced by call to __c_mcopy8
179, Memory copy idiom, loop replaced by call to __c_mcopy4
183, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
361, Memory zero idiom, loop replaced by call to __c_mzero8
Memory zero idiom, loop replaced by call to __c_mzero4
364, Generated an alternate version of the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
374, Generating copy(csrvala(:),indx(:),rval(:),jndx(:),csrrowptra(:))
392, Generating enter data copyin(p(:),pbuffer(:))
411, Generating exit data delete(p(:),pbuffer(:))
415, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
431, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda sp_solvers.f90
vec_max:
33, Generated an alternate version of the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
csr_mat_max:
51, Generated an alternate version of the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
cusolversps:
139, Memory copy idiom, loop replaced by call to __c_mcopy8
207, Generating copyin(csrvala(:))
Generating copy(x(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating copy(b(:))
229, Loop not vectorized/parallelized: contains call
233, Loop not vectorized/parallelized: contains call
238, Memory copy idiom, loop replaced by call to __c_mcopy8
245, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
Generating copy(b(:),x(:))
283, Generating copyin(csrvala(:))
Generating copy(x(:))
Generating copyin(csrrowptra(:),csrcolinda(:))
Generating copy(b(:))
305, Loop not vectorized/parallelized: contains call
309, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda main.f90
main:
66, Memory set idiom, loop replaced by call to __c_mset8
98, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o a.out precision.o mmio.o cusolverSP_mod.o measure_time.o mm_convert_csr.o sp_solvers.o main.o
-acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda -Mcudalib=cusparse,cusolver
サンプル係数行列(mtxデータ)については、前章の説明 を参照して欲しい。提供した tar file の中には、サンプルとして lap3D_7pt_n20.mtx 等の mtx 形式の対称行列ファイルを含んでいるので、これをテスト用データして使用してほしい。
make run コマンドで実行が開始される。入力 mtx ファイルが求められるので、入手した mtx ファイル名を指定する。次に rhs ベクトルファイルが存在する場合は、そのファイル名を指定する。存在しない場合は、enter を押下し、実行に進む。以下の例では、symrcm リオーダリングを施して、コレスキー分解にて計算した結果である。Intel Haswell CPU 上の実行では 4スレッドで 47 秒、1スレッドでは 160 秒の時間が掛かるが、GPU 上では geforce 1080ti という倍精度演算器の極めて少ない ボードを使用しても 19 秒程度で終了する。
[kato@photon32]$ export OMP_NUM_THREADS=4 (CPU版の実行では4スレッド並列で実行する)
[kato@photon32]$ make run
$ make run
===== Program run =====
a.out
0 = CPU and GPU, 1 = GPU only? CPU+ GPU 計算か、 GPUのみの計算?
0
1 = cholesky, 2 = LU, 3 = QR ; What solver? ソルバーの種類選択
1
0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering? Reordering scheme の選択
1
=================================================
Sparse matrix linear solver schem : Cholesky
Reordering scheme of matrix : symrcm
=================================================
File name for Matrix Market? 行列の入力ファイル名
/home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx
Matrix Market file name = /home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx
Properties of Matrix_Market matrix
====================================
representation = coordinate
type = real
matrix type = symmetric
row (m) = 525825
cols(n) = 525825
Non-zero elm. = 3674625
extendSymMatrix= 1
=======================================================================================
NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers.
=======================================================================================
File name for LHS vector? if no available, press ENTER and set as RHS=1.0.
! RHS(右辺ベクトル)の用意がないため、enter
i,indx,jndx 1 1 1 0.1000003178914388
i,indx,jndx 2 1 523522 -4.9999841054280604E-002
i,indx,jndx 3 1 523523 -4.9999841054280604E-002
i,indx,jndx 4 2 2 0.2000009536743164
i,indx,jndx 5 2 523905 -4.9999841054280604E-002
i,indx,jndx 6 2 523906 3.1789143880208332E-007
i,indx,jndx 7 2 524290 -9.9999682108561208E-002
i,indx,jndx 8 2 524802 -4.9999841054280604E-002
i,indx,jndx 9 3 3 0.1000006357828777
i,indx,jndx 10 3 525057 -4.9999841054280646E-002
i,indx,jndx 3674615 525823 525823 0.2000009536743135
i,indx,jndx 3674616 525824 33153 -4.9999841054280535E-002
i,indx,jndx 3674617 525824 131841 -4.9999841054276260E-002
i,indx,jndx 3674618 525824 328447 -9.9999682108561472E-002
i,indx,jndx 3674619 525824 426751 3.1789143884642877E-007
i,indx,jndx 3674620 525824 525824 0.2000009536743123
i,indx,jndx 3674621 525825 5 -4.9999841054280604E-002
i,indx,jndx 3674622 525825 131841 -4.9999841054280694E-002
i,indx,jndx 3674623 525825 328448 3.1789143862453213E-007
i,indx,jndx 3674624 525825 525313 -9.9999682108558099E-002
i,indx,jndx 3674625 525825 525825 0.2000009536743136
== CSR data check : Start from 1 to 10 elements ==
csrRowPtrA=
1 4 9 13 16 21 25
30 37 42
csrColIndA=
1 523522 523523 2 523905 523906 524290
524802 3 525057
csrValA =
0.1000D+00 -.5000D-01 -.5000D-01 0.2000D+00 -.5000D-01 0.3179D-06 -.1000D+00
-.5000D-01 0.1000D+00 -.5000D-01
== Start from endpoint-10 to endpoint ==
csrRowPtrA=
3674576 3674581 3674586 3674591 3674596 3674601 3674606
3674611 3674616 3674621 3674626
csrColIndA=
525823 33153 131841 328447 426751 525824 5
131841 328448 525313 525825
csrValA=
0.2000D+00 -.5000D-01 -.5000D-01 -.1000D+00 0.3179D-06 0.2000D+00 -.5000D-01
-.5000D-01 0.3179D-06 -.1000D+00 0.2000D+00
==================================================================
The linear system is solved by sparse CHOLESKY factorization.
Reordering scheme : symrcm
Tolerance to decide if singular or not = .1000000000D-11
A matrix has symmmetric pattern.
==================================================================
4スレッド並列時
cuSOLVER SP solver result on [ CPU ] 47.714441 (sec)
==================================================================================================
|b - A*x| = 0.3055902197957D-09
|A| = 0.8000012715659D+00
|x| = 0.2638521162680D+06
|b - A*x|/(|A|*|x|) = 0.1447732059137D-14
x( 1)=0.263852116268E+06
x( 2)=0.263093899026E+06
x( 3)=0.263844899226E+06
x( 4)=0.263852116268E+06
x( 5)=0.263093899027E+06
x( 6)=0.263844899225E+06
x( 7)=0.263199487139E+06
x( 8)=0.262666049525E+06
x( 9)=0.263304916648E+06
x( 10)=0.262560622769E+06
x( 525815)=0.263094481758E+06
x( 525816)=0.263094376198E+06
x( 525817)=0.263094281209E+06
x( 525818)=0.263094196778E+06
x( 525819)=0.263094122892E+06
x( 525820)=0.263094059539E+06
x( 525821)=0.263094006710E+06
x( 525822)=0.263093964396E+06
x( 525823)=0.263093932591E+06
x( 525824)=0.263093911290E+06
x( 525825)=0.263093900489E+06
cuSOLVER SP solver result on [ GPU ] 19.054326 (sec)
==================================================================================================
|b - A*x| = 0.1527951098979D-09
|A| = 0.8000012715659D+00
|x| = 0.2638521162680D+06
|b - A*x|/(|A|*|x|) = 0.7238660295685D-15
x( 1)=0.263852116268E+06
x( 2)=0.263093899027E+06
x( 3)=0.263844899226E+06
x( 4)=0.263852116268E+06
x( 5)=0.263093899027E+06
x( 6)=0.263844899226E+06
x( 7)=0.263199487139E+06
x( 8)=0.262666049525E+06
x( 9)=0.263304916649E+06
x( 10)=0.262560622769E+06
x( 525815)=0.263094481758E+06
x( 525816)=0.263094376198E+06
x( 525817)=0.263094281209E+06
x( 525818)=0.263094196778E+06
x( 525819)=0.263094122892E+06
x( 525820)=0.263094059539E+06
x( 525821)=0.263094006710E+06
x( 525822)=0.263093964396E+06
x( 525823)=0.263093932591E+06
x( 525824)=0.263093911290E+06
x( 525825)=0.263093900490E+06
上記で示したプログラム一式は、Linux 上で動作させることを前提としたものである。同じプログラムを Windows 上で動作させるために Makefile を変更したものとソース・ファイル一式を以下に提供する。以下のファイルをダウンロードしていただきたい。Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI 用の cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。
プログラム tar ファイル:cuSOLVERsp_solvers.zip
Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI が提供する cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。また、cuSOLVER SP ルーチンの Windows 版は、CPU 実行用のソルバー・ルーチンに対して、OpenMP 並列マルチスレッド対応がなされていないことである。CPU上のソルバーはシングル・スレッドで実行される。
以下の例は、PGI 17.10 バージョンの PGI コマンドプロンプト(端末)を使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Windows -tp sandybridge PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 Device Number: 0 Device Name: GeForce GTX 780 Device Revision Number: 3.5 Global Memory Size: 3221225472 Number of Multiprocessors: 12 Number of SP Cores: 2304 Number of DP Cores: 768 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 941 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 3004 MHz Memory Bus Width: 384 bits L2 Cache Size: 1572864 bytes Max Threads Per SMP: 2048 Async Engines: 1 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc35
PGI$ make
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda precision.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed mmio.f
mmread:
154, Loop not vectorized/parallelized: potential early exits
199, Loop not vectorized/parallelized: potential early exits
203, Loop not vectorized/parallelized: potential early exits
207, Loop not vectorized/parallelized: potential early exits
212, Loop not vectorized/parallelized: potential early exits
251, Loop not vectorized/parallelized: potential early exits
255, Loop not vectorized/parallelized: potential early exits
259, Loop not vectorized/parallelized: potential early exits
mminfo:
455, Loop not vectorized/parallelized: potential early exits
mmwrite:
678, Loop not vectorized/parallelized: contains call
682, Loop not vectorized/parallelized: contains call
686, Loop not vectorized/parallelized: contains call
692, Loop not vectorized/parallelized: contains call
709, Loop not vectorized/parallelized: contains call
713, Loop not vectorized/parallelized: contains call
718, Loop not vectorized/parallelized: contains call
lowerc:
771, Loop not vectorized/parallelized: contains call
getwd:
795, Loop not vectorized/parallelized: potential early exits
countwd:
826, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda cusolverSP_mod.cuf
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda mm_convert_csr.f90
matrix_printout:
70, Loop not vectorized/parallelized: contains call
79, Loop not vectorized/parallelized: contains call
set_matrix:
140, Unrolled inner loop 8 times
Generated 2 prefetch instructions for the loop
150, Loop not vectorized/parallelized: contains call
171, Memory copy idiom, loop replaced by call to __c_mcopy4
172, Memory copy idiom, loop replaced by call to __c_mcopy4
175, Memory copy idiom, loop replaced by call to __c_mcopy8
179, Memory copy idiom, loop replaced by call to __c_mcopy4
183, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
361, Memory zero idiom, loop replaced by call to __c_mzero8
Memory zero idiom, loop replaced by call to __c_mzero4
364, Generated 3 alternate versions of the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
Generated vector simd code for the loop
Generated 2 prefetch instructions for the loop
374, Generating copy(csrvala(:),indx(:),jndx(:),rval(:),csrrowptra(:))
392, Generating enter data copyin(p(:),pbuffer(:))
411, Generating exit data delete(pbuffer(:),p(:))
415, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
431, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda sp_solvers.f90
vec_max:
33, Generated 2 alternate versions of the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
csr_mat_max:
51, Generated 2 alternate versions of the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
Generated vector simd code for the loop containing reductions
Generated a prefetch instruction for the loop
cusolversps:
139, Memory copy idiom, loop replaced by call to __c_mcopy8
207, Generating copy(x(:))
Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
Generating copy(b(:))
229, Loop not vectorized/parallelized: contains call
233, Loop not vectorized/parallelized: contains call
238, Memory copy idiom, loop replaced by call to __c_mcopy8
239, Memory zero idiom, loop replaced by call to __c_mzero8
246, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
Generating copy(b(:),x(:))
284, Generating copy(x(:))
Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:))
Generating copy(b(:))
306, Loop not vectorized/parallelized: contains call
310, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda main.f90
main:
66, Memory set idiom, loop replaced by call to __c_mset8
98, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o a.out precision.obj mmio.obj cusolverSP_mod.obj measure_time.obj mm_convert_csr.obj sp_solvers.obj main.obj -acc -ta=tesla,
cuda9.0,cc35,cc60 -Mcuda -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64" -Mcudalib=cusparse,cusolver
PGI$ make run
===== Program run =====
a.out
0 = CPU and GPU, 1 = GPU only?
0
1 = cholesky, 2 = LU, 3 = QR ; What solver?
3
0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?
1
=================================================
Sparse matrix linear solver schem : QR
Reordering scheme of matrix : symrcm
=================================================
File name for Matrix Market?
lap3D_7pt_n20.mtx
Matrix Market file name = lap3D_7pt_n20.mtx
Properties of Matrix_Market matrix
====================================
representation = coordinate
type = real
matrix type = symmetric
row (m) = 8000
cols(n) = 8000
Non-zero elm. = 53600
extendSymMatrix= 1
=======================================================================================
NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers.
=======================================================================================
File name for LHS vector? if no available, press ENTER and set as RHS=1.0.
i,indx,jndx 1 1 1 6.000000000000000
i,indx,jndx 2 1 2 -1.000000000000000
i,indx,jndx 3 1 21 -1.000000000000000
i,indx,jndx 4 1 401 -1.000000000000000
i,indx,jndx 5 2 1 -1.000000000000000
i,indx,jndx 6 2 2 6.000000000000000
i,indx,jndx 7 2 3 -1.000000000000000
i,indx,jndx 8 2 22 -1.000000000000000
i,indx,jndx 9 2 402 -1.000000000000000
i,indx,jndx 10 3 2 -1.000000000000000
i,indx,jndx 53590 7998 7998 6.000000000000000
i,indx,jndx 53591 7998 7999 -1.000000000000000
i,indx,jndx 53592 7999 7599 -1.000000000000000
i,indx,jndx 53593 7999 7979 -1.000000000000000
i,indx,jndx 53594 7999 7998 -1.000000000000000
i,indx,jndx 53595 7999 7999 6.000000000000000
i,indx,jndx 53596 7999 8000 -1.000000000000000
i,indx,jndx 53597 8000 7600 -1.000000000000000
i,indx,jndx 53598 8000 7980 -1.000000000000000
i,indx,jndx 53599 8000 7999 -1.000000000000000
i,indx,jndx 53600 8000 8000 6.000000000000000
== CSR data check : Start from 1 to 10 elements ==
csrRowPtrA=
1 5 10 15 20 25 30
35 40 45
csrColIndA=
1 2 21 401 1 2 3
22 402 2
csrValA =
0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01
-.1000D+01 -.1000D+01 -.1000D+01
== Start from endpoint-10 to endpoint ==
csrRowPtrA=
53552 53557 53562 53567 53572 53577 53582
53587 53592 53597 53601
csrColIndA=
7998 7999 7599 7979 7998 7999 8000
7600 7980 7999 8000
csrValA=
0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01
-.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01
==================================================================
The linear system is solved by sparse QR factorization.
Reordering scheme : symrcm
Tolerance to decide if singular or not = .1000000000D-11
A matrix has symmmetric pattern.
==================================================================
cuSOLVER SP solver result on [ CPU ] 2.355999 (sec)
==================================================================================================
|b - A*x| = 0.5115907697473D-12
|A| = 0.1200000000000D+02
|x| = 0.2458019372584D+02
|b - A*x|/(|A|*|x|) = 0.1734427507818D-14
x( 1)=0.666997357463E+00
x( 2)=0.100066138159E+01
x( 3)=0.119691025971E+01
x( 4)=0.132458774155E+01
x( 5)=0.141248710662E+01
x( 6)=0.147457380480E+01
x( 7)=0.151841489418E+01
x( 8)=0.154840911171E+01
x( 9)=0.156717865908E+01
x( 10)=0.157622105444E+01
x( 7990)=0.157622105444E+01
x( 7991)=0.157622105444E+01
x( 7992)=0.156717865908E+01
x( 7993)=0.154840911171E+01
x( 7994)=0.151841489418E+01
x( 7995)=0.147457380480E+01
x( 7996)=0.141248710662E+01
x( 7997)=0.132458774155E+01
x( 7998)=0.119691025971E+01
x( 7999)=0.100066138159E+01
x( 8000)=0.666997357463E+00
cuSOLVER SP solver result on [ GPU ] 0.683000 (sec)
==================================================================================================
|b - A*x| = 0.4547473508865D-12
|A| = 0.1200000000000D+02
|x| = 0.2458019372584D+02
|b - A*x|/(|A|*|x|) = 0.1541713340283D-14
x( 1)=0.666997357463E+00
x( 2)=0.100066138159E+01
x( 3)=0.119691025971E+01
x( 4)=0.132458774155E+01
x( 5)=0.141248710662E+01
x( 6)=0.147457380480E+01
x( 7)=0.151841489418E+01
x( 8)=0.154840911171E+01
x( 9)=0.156717865908E+01
x( 10)=0.157622105444E+01
x( 7990)=0.157622105444E+01
x( 7991)=0.157622105444E+01
x( 7992)=0.156717865908E+01
x( 7993)=0.154840911171E+01
x( 7994)=0.151841489418E+01
x( 7995)=0.147457380480E+01
x( 7996)=0.141248710662E+01
x( 7997)=0.132458774155E+01
x( 7998)=0.119691025971E+01
x( 7999)=0.100066138159E+01
x( 8000)=0.666997357463E+00