前章 12-3 では、OpenACC + cuSPARSE を利用した Bi-Conjugate Gradient Stabilized(BiCGStab)プログラムを用いて、スパース行列に対する「反復法による線形方程式」を解く例を示した。この章では、スパース行列を対象とした「直接法による線形方程式」を解くプログラム例を示す。いわゆる、スパース行列用の LAPACK ルーチンである。これは、cuSOLVER SP ライブラリの関数を使用することにより実現する。OpenACC を利用しているため、極めて少ない工数で、GPUアクセラレータを使用した計算が可能となる。ここでは、Fortran プログラムの例を示すこととする。ここで例示するソース・ファイルと Makefile 一式をアーカイブした tar ファイルを提供する。以下のファイルをダウンロードしていただきたい。
プログラム tar ファイル:openacc_cuSOLVER-SP.tar.gz
以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.59 Wed Jul 19 23:53:34 PDT 2017 Device Number: 0 Device Name: GeForce GTX 1080 Ti Device Revision Number: 6.1 Global Memory Size: 11714691072 Number of Multiprocessors: 28 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 1657 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 5505 MHz Memory Bus Width: 352 bits L2 Cache Size: 2883584 bytes Max Threads Per SMP: 2048 Async Engines: 2 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc60
COO フォーマットから CSR フォーマットへの変換プログラムの概要に関しては、前章の説明を参照していただきたい。
cuSOLVER SP ルーチンを使用する上での注意点を述べる。一般に対称行列として、行列要素が上三角行列のみデータとして収納されているような場合は、必ず、下三角行列要素も収納する形態に拡張して、cuSOLVER SP ルーチンに入力する必要がある。そこで、mm_convert_csr.f90 プログラムは、Matrix Market フォーマットの「symmetric」属性を有する対称行列要素を「一般行列」に拡張するようにしている。プログラム内で、対称行列を一般行列に拡張するためのフラグ変数は extendSymMatrix であり、これを1にセットすることで制御している。
入力データとして必要なスパース行列の CSR フォーマットの詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。
nz | 整数 | 行列内の非ゼロ要素の数 |
csrValA | 配列 | 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format. |
csrRowPtrA | 配列 | 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる。ここでは 1 を使用する。 |
csrColIndA | 配列 | 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列 |
NVIDIA cuSOLVER SP ルーチンの技術的なドキュメントは、こちらのページを参照して欲しい。cuSolverSP は、スパース QR 分解に基づく直接法による新しいスパースルーチンのセットを提供している。また、その他に、コレスキー分解による解法、LU分解による解法ルーチンも提供している。今回のプログラムでは、この3種のソルバーが動作するように作成してある。数値計算シミュレーションにおいては、全ての行列が factorization の並列性に優れたスパース性のパターンを有している訳ではないため、cuSolverSP ライブラリは、シーケンシャルのような行列を扱うことができるように CPU 上で動作するルーチンも提供している(なお、LU 分解用のルーチンは CPU 用のみであり、GPU用のルーチンは CUDA 9.0 の段階ではまだ提供されていない)。 高い並列性を持つマトリックスを対象とした場合は、GPU 用のルーチンはより高いパフォーマンスを提供することができる。なお、LU 分解を除く CPU 用のコレスキー分解と QR 分解のルーチンは、OpenMP による並列プログラムとなっており、マルチコアによる並列実行が可能である。
cuSOLVER SP ルーチンには、行列要素のリオーダリング(再順序付け)機能を備えており、関数の引数により制御可能となっている。分解におけるパフォーマンスに大きな影響を与える fill-in の発生を減らすために、symrcm とsymamd の2つの並べ替えスキームを提供している。 引数である入力パラメータ reorder は、reorder が 1 の場合に symrcm(Symmetric Reverse Cuthill-McKee置換)、2 の場合 symamd (Symmetric Approximate Minimum Degree Algorithm based on Quotient Graph) を有効にする。それ以外の場合、並べ替えは実行されない。
分解時に行列が singular の状態になった場合は、それが発生した最小の行インデックスを出力し、WARNING を出す。その計算結果は不正となる。
ここでは、今回使用しているスパース行列用の線形方程式解法ルーチンのサマライズを以下に纏める。提供プログラムの sp_solvers.f90 ルーチンがドライバールーチンとなる。
solver No. (solver) |
cuSOLVER関数名 | CPU用 | GPU用 | Reordering (reorder) |
1 | コレスキー分解 cusolverSpDcsrlsvchol[Host] |
有り OMP_NUM_THREADSで 並列スレッド数指定 |
有り | No reordering = 0 symrcm reordering = 1 symamd reordering = 2 |
2 | LU分解 cusolverSpDcsrlsvlu[Host] |
有り シングルスレッドのみ |
なし | |
3 | QR分解 cusolverSpDcsrlsvqr[Host] |
有り OMP_NUM_THREADSで 並列スレッド数指定 |
有り |
PGI 17.10 時点では CUDA C/C++ 言語で開発されている cuSOLVER SP(スパース)系のライブラリへの Fortran Interface Module が提供されていないため、以下のように自作(cusolverSP_mod.cuf)した。C と Fortran のバインディングを取り持つためのインタフェースである。将来的に、PGI コンパイルシステムの中に、cuSOLVER SP の Fortran MODULE (多分 use cusplverSP で使用できるはず)が提供された場合は、以下の cusolverSP_mod.cuf ルーチンは必要なくなる。
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved. ! ! Permission is hereby granted, free of charge, to any person obtaining a ! copy of this software and associated documentation files (the "Software"), ! to deal in the Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, sublicense, ! and/or sell copies of the Software, and to permit persons to whom the ! Software is furnished to do so, subject to the following conditions: ! ! The above copyright notice and this permission notice shall be included in ! all copies or substantial portions of the Software. ! ! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR ! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL ! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER ! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING ! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER ! DEALINGS IN THE SOFTWARE. ! module cuSOLVER_SP_Solver use iso_c_binding use cudafor use cusparse use cusolver_common implicit none ! enum, bind(C) ! cusolverStatus_t ! enumerator :: CUSOLVER_STATUS_SUCCESS=0 ! enumerator :: CUSOLVER_STATUS_NOT_INITIALIZED=1 ! enumerator :: CUSOLVER_STATUS_ALLOC_FAILED=2 ! enumerator :: CUSOLVER_STATUS_INVALID_VALUE=3 ! enumerator :: CUSOLVER_STATUS_ARCH_MISMATCH=4 ! enumerator :: CUSOLVER_STATUS_MAPPING_ERROR=5 ! enumerator :: CUSOLVER_STATUS_EXECUTION_FAILED=6 ! enumerator :: CUSOLVER_STATUS_INTERNAL_ERROR=7 ! enumerator :: CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED=8 ! enumerator :: CUSOLVER_STATUS_NOT_SUPPORTED = 9 ! enumerator :: CUSOLVER_STATUS_ZERO_PIVOT=10 ! enumerator :: CUSOLVER_STATUS_INVALID_LICENSE=11 ! end enum type cusolverSpHandle type(c_ptr) :: handle end type cusolverSpHandle ! cusolverSpCreate interface integer(c_int) function cusolverSpCreate(handle) bind(C,name='cusolverSpCreate') import cusolverSpHandle type(cusolverSpHandle) :: handle end function cusolverSpCreate end interface ! cusolverSpDestroy interface integer(c_int) function cusolverSpDestroy(handle) bind(C,name='cusolverSpDestroy') import cusolverSpHandle type(cusolverSpHandle), value :: handle end function cusolverSpDestroy end interface ! cusolverSpXcsrissymHost interface integer(c_int) function cusolverSpXcsrissymHost(handle, m, nnzA, descrA, & csrRowPtrA, csrEndPtrA, csrColIndA, issym) bind(C,name='cusolverSpXcsrissymHost') use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusparseMatDescr), value :: descrA type(cusolverSpHandle), value :: handle integer(c_int), value :: m, nnzA integer(c_int) :: issym integer(c_int) :: csrRowPtrA(*), csrEndPtrA(*), csrColIndA(*) end function cusolverSpXcsrissymHost end interface ! cusolverSpXcsrsymrcmHost interface integer(c_int) function cusolverSpXcsrsymrcmHost(handle, m, nnzA, descrA, & csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymrcmHost') use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusparseMatDescr), value :: descrA type(cusolverSpHandle), value :: handle integer(c_int), value :: m, nnzA integer(c_int) :: q(*) integer(c_int) :: csrRowPtrA(*), csrColIndA(*) end function cusolverSpXcsrsymrcmHost end interface ! cusolverSpXcsrsymmdqHost interface integer(c_int) function cusolverSpXcsrsymmdqHost(handle, m, nnzA, descrA, & csrRowPtrA, csrColIndA, q) bind(C,name='cusolverSpXcsrsymmdqHost') use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusparseMatDescr), value :: descrA type(cusolverSpHandle), value :: handle integer(c_int), value :: m, nnzA integer(c_int) :: q(*) integer(c_int) :: csrRowPtrA(*), csrColIndA(*) end function cusolverSpXcsrsymmdqHost end interface ! cusolverSpXcsrperm_bufferSizeHost interface integer(c_int) function cusolverSpXcsrperm_bufferSizeHost(handle, m, n, nnzA, & descrA, csrRowPtrA, csrColIndA, p, q, bufferSizeInBytes ) & bind(C,name="cusolverSpXcsrperm_bufferSizeHost") use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusolverSpHandle), value :: handle type(cusparseMatDescr), value :: descrA integer(c_int), value :: m, n, nnzA integer(c_int) :: p(*), q(*) integer(c_int) :: csrRowPtrA(*), csrColIndA(*) integer(c_long) :: bufferSizeInBytes end function cusolverSpXcsrperm_bufferSizeHost end interface ! cusolverSpXcsrpermHost interface integer(c_int) function cusolverSpXcsrpermHost(handle, m, n, nnzA, & descrA, csrRowPtrA, csrColIndA, p, q, map, buffer ) & bind(C,name="cusolverSpXcsrpermHost") use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusolverSpHandle), value :: handle type(cusparseMatDescr), value :: descrA integer(c_int), value :: m, n, nnzA integer(c_int) :: p(*), q(*) integer(c_int) :: csrRowPtrA(*), csrColIndA(*) integer(c_int) :: map(*) character(c_char) :: buffer(*) end function cusolverSpXcsrpermHost end interface ! cusolverSpDcsrlsvcholHost interface integer(c_int) function cusolverSpDcsrlsvcholHost( handle, m, nnzA, & descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) & bind(C,name="cusolverSpDcsrlsvcholHost") use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusolverSpHandle), value :: handle type(cusparseMatDescr), value :: descrA integer(c_int), value :: m, nnzA real(c_double), intent(in) :: csrValA(:) integer(c_int), intent(in) :: csrRowPtrA(:), csrColIndA(:) real(c_double), intent(in) :: b(*) real(c_double) :: x(*) real(c_double), value :: tol integer(c_int), value :: reorder integer(c_int) :: singularity end function cusolverSpDcsrlsvcholHost end interface ! cusolverSpDcsrlsvluHost interface integer(c_int) function cusolverSpDcsrlsvluHost( handle, m, nnzA, & descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) & bind(C,name="cusolverSpDcsrlsvluHost") use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusolverSpHandle), value :: handle type(cusparseMatDescr), value, intent(in) :: descrA integer(c_int), value :: m, nnzA real(c_double), intent(in) :: csrValA(*) integer(c_int), intent(in) :: csrRowPtrA(*), csrColIndA(*) real(c_double), intent(in) :: b(*) real(c_double) :: x(*) real(c_double), value :: tol integer(c_int), value :: reorder integer(c_int) :: singularity end function cusolverSpDcsrlsvluHost end interface ! cusolverSpDcsrlsvqrHost interface integer(c_int) function cusolverSpDcsrlsvqrHost( handle, m, nnzA, & descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) & bind(C,name="cusolverSpDcsrlsvqrHost") use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusolverSpHandle), value :: handle type(cusparseMatDescr), value, intent(in) :: descrA integer(c_int), value :: m, nnzA real(c_double), intent(in) :: csrValA(*) integer(c_int), intent(in) :: csrRowPtrA(*), csrColIndA(*) real(c_double), intent(in) :: b(*) real(c_double) :: x(*) real(c_double), value :: tol integer(c_int), value :: reorder integer(c_int) :: singularity end function cusolverSpDcsrlsvqrHost end interface ! cusolverSpDcsrlsvchol interface integer(c_int) function cusolverSpDcsrlsvchol( handle, m, nnzA, & descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) & bind(C,name="cusolverSpDcsrlsvchol") use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusolverSpHandle), value :: handle type(cusparseMatDescr), value :: descrA integer(c_int), value :: m, nnzA real(c_double), device, intent(in) :: csrValA(*) integer(c_int), device, intent(in) :: csrRowPtrA(*), csrColIndA(*) real(c_double), device, intent(in) :: b(*) real(c_double), device :: x(*) real(c_double), value :: tol integer(c_int), value :: reorder integer(c_int) :: singularity end function cusolverSpDcsrlsvchol end interface ! cusolverSpDcsrlsvqr interface integer(c_int) function cusolverSpDcsrlsvqr( handle, m, nnzA, & descrA, csrValA, csrRowPtrA, csrColIndA, b, tol, reorder, x, singularity ) & bind(C,name="cusolverSpDcsrlsvqr") use iso_c_binding import cusolverSpHandle, cusparseMatDescr type(cusolverSpHandle), value :: handle type(cusparseMatDescr), value :: descrA integer(c_int), value :: m, nnzA real(c_double), device, intent(in) :: csrValA(*) integer(c_int), device, intent(in) :: csrRowPtrA(*), csrColIndA(*) real(c_double), device, intent(in) :: b(*) real(c_double), device :: x(*) real(c_double), value :: tol integer(c_int), value :: reorder integer(c_int) :: singularity end function cusolverSpDcsrlsvqr end interface end module cuSOLVER_SP_Solver
以下に、メインプログラムを示す。このプログラムは、複素数の係数行列は対応していない。倍精度実数型のデータのみを扱える。最初に、Matrix Market 形式のファイル名を指定し、次に right hand side(rhs) データファイルがあれば、それを指定する。もし、存在しなければ、rhs ベクトルは全て 1.0 として計算する。入力データの読み込み後、coo2csr_convert ルーチンを call して、COO 形式データからCSR 形式データに変換する。変換後、csrRowPtrA、csrColIndA、csrValA 配列が作成され、この CSR 形式の行列を実引数として、cusolverSPs ドライバールーチンに渡し、計算が始まる。このプログラムとは別に、ユーザ側のプログラムで、csrRowPtrA、csrColIndA、csrValA 配列を用意できれば、cusolverSPs ドライバールーチンを使用して線形方程式を解くことができる。
program main
use precision
use Matrix_data
use mm_coo_cuda_csr_convert
use SP_solvers
implicit none
integer :: i, j, istat
integer,parameter :: iunit1 = 10 ! for Matrix Market data (COO)
integer,parameter :: iunit2 = 11 ! for LHS, if available
character*80 :: mm_file, rhs_file
integer :: extendSymMatrix ! 1 = symmetrize the pattern (transform from triangular to full)
integer :: gpuonly ! 0 = both CPU and GPU, 1 = GPU only
integer :: solver ! 1 = cholesky, 2 = LU, 3 = QR
integer :: reorder ! Reordering ; 0 = no, 1 = symrcm, 2 = symamd reordering
! real(fp_kind), parameter :: tol = 0.0d0 ! tolerance to decide if singular or not
real(fp_kind), parameter :: tol = 1.0d-12 ! tolerance to decide if singular or not
integer, parameter :: index_base = 1 ! CSR format -- one based index
! Kind of Solver, a scheme of reordering
print *, "0 = CPU and GPU, 1 = GPU only?"
read (5, '(i1)') gpuonly
print *, "1 = cholesky, 2 = LU, 3 = QR ; What solver?"
read (5, '(i1)') solver
if ( .not. (solver == 1 .or. solver == 2 .or. solver == 3) ) solver = 1
print *, "0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering?"
read (5, '(i1)') reorder
print '(/1x,a)', "================================================="
if ( solver == 1) print *, "Sparse matrix linear solver schem : Cholesky"
if ( solver == 2) print *, "Sparse matrix linear solver schem : LU"
if ( solver == 3) print *, "Sparse matrix linear solver schem : QR"
if ( reorder == 1 ) then
print *, "Reordering scheme of matrix : symrcm"
else if ( reorder == 2 ) then
print *, "Reordering scheme of matrix : symamd"
else
reorder = 0
print *, "No reordering of matrix"
end if
print '(1x,a/)', "================================================="
! Read matrix Market data file
extendSymMatrix = 1 ! if =1 .and. symmetric matrix=yes , transform from triangular to full
print *, "File name for Matrix Market?"
read (5, '(a)') mm_file
! mm_file ="lap2D_5pt_n100.mtx"
print '(/1x,a,a)', "Matrix Market file name = ",mm_file
open (iunit1, file=mm_file, status='old')
print *
call mm_matrix_read(iunit1, extendSymMatrix)
! Read right hand side vectors, if available.
print *, "File name for LHS vector? if no available, press ENTER and set as RHS=1.0."
print *
read (5, '(a)') rhs_file
if (rhs_file == "") then
allocate (rhs(rows))
rhs= 1.0_fp_kind
else
print *, "rhs_file name = ",rhs_file
open (iunit2, file=rhs_file, status='old')
call mm_rhs_read(iunit2)
end if
! Convert COO to CSR sparse matrix
! INPUT : Matrix COO data from Matrix Market format via mm_rhs_read routine
! OUTPUT : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data
call coo2csr_convert
! CSR sparse matrix data are set. The belows are arguments for cusplverSPs routine.
! Arrays for CUDA cuSPARSE CSR format
! integer :: m IN ! rows size for square matrix
! integer :: n IN ! columns size for square matrix (n=m)
! integer :: nz IN ! nonzero element of A
! integer :: csrRowPtrA (m+1) IN ! the array of offsets corresponding to the start of each row
! integer :: csrColIndA (nz) IN ! sorted by row and by column within each row
! real(fp_kind) :: csrValA (nz) IN ! sorted by row and by column within each row
! real(fp_kind) :: x(m) INOUT ! initial guess & result
! real(fp_kind) :: rhs(m) IN ! right hand side
! Print CSR Matrix data
! print *, "csrValA", csrValA
! print *, "csrRowPtrA",csrRowPtrA
! print *, "csrColIndA", csrColIndA
! Initilize X
allocate (x(n))
x = 0.0_fp_kind ! or Initial guess X is set
! cuSOLVER SP (Cholesky, LU, QR)
call cusolverSPs (gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
tol, index_base)
end program main
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved. ! ! Permission is hereby granted, free of charge, to any person obtaining a ! copy of this software and associated documentation files (the "Software"), ! to deal in the Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, sublicense, ! and/or sell copies of the Software, and to permit persons to whom the ! Software is furnished to do so, subject to the following conditions: ! ! The above copyright notice and this permission notice shall be included in ! all copies or substantial portions of the Software. ! ! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR ! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL ! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER ! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING ! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER ! DEALINGS IN THE SOFTWARE. ! module SP_solvers use precision implicit none contains real(fp_kind) function vec_max(n, x) implicit none integer :: i, n real(fp_kind) :: x(n) vec_max = 0.0_fp_kind do i = 1, n vec_max = max( abs(x(i)), vec_max ) end do end function real(fp_kind) function csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA) implicit none integer :: m, n, nz integer :: csrRowPtrA(m+1), csrColIndA(nz) real(fp_kind) :: csrValA(nz) real(fp_kind) :: sum, norminf integer :: i, j, start, end norminf = 0.0_fp_kind do i = 1, m sum = 0.0_fp_kind start = csrRowPtrA(i ) end = csrRowPtrA(i+1) - 1 do j = start, end sum = sum + abs(csrValA(j)) end do norminf = max (norminf, sum) end do csr_mat_max = norminf end function !============================================================= subroutine cusolverSPs(gpuonly, solver, reorder, m, n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, & tol, index_base) !------------------------------------------------------------- ! Bi-Conjugate Gradient Stabilized (BiCGStab) ! A is an n × n sparse matrix defined in CSR storage format !------------------------------------------------------------- use measure_time use cusparse use cuSOLVER_SP_Solver implicit none ! Arrays for CUDA cuSPARSE CSR format integer :: m ! number of rows of matrix integer :: n ! number of columns of matrix integer :: nz ! nonzero elements of A integer :: csrRowPtrA(n+1) ! the array of offsets corresponding to the start of each row integer :: csrColIndA (nz) ! sorted by row and by column within each row real(fp_kind) :: csrValA (nz) ! sorted by row and by column within each row ! B = Q*A*Q^T real(fp_kind) :: x (n) ! initial guess & result real(fp_kind) :: rhs(n) ! right hand side real(fp_kind) :: b (n) ! a copy of rhs character(c_char), allocatable :: buffer(:) integer :: index_base integer :: gpuonly ! 0 = both CPU and GPU, 1 = GPU only integer :: solver ! 1 = cholesky, 2 = LU, 3 = QR integer :: reorder ! manula reordering 1 = symrcm, 2 = symamd reordering integer :: singularity ! -1 if A is invertible under tol integer :: issym real(fp_kind) :: rsum, diff, error real(fp_kind) :: x_inf, r_inf, A_inf real(fp_kind) :: tol ! to decide if singular or not ! Local variables type(cusolverSpHandle) :: cusolver_H type(cusparseHandle) :: cusparse_H type(cusparseMatDescr) :: descrA integer :: cusolver_status integer :: status, ierr_code real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0 integer :: i, j, k integer :: nprint ! Parameters integer, parameter :: idbg = 0 real(fp_kind),parameter :: eps = 6.9d-310 ! machine epsilon ! Initialize singularity = 0 issym = 0 if (m /= n) then print *,"Error: only support square matrix." stop end if ! print '(/1x,a)',"==================================================================" if (solver == 1) then print '(1x,a)', "The linear system is solved by sparse CHOLESKY factorization." else if (solver == 2) then print '(1x,a)', "The linear system is solved by sparse LU factorization." else if (solver == 3) then print '(1x,a)', "The linear system is solved by sparse QR factorization." end if if (reorder == 0) then print '(1x,a)', "Reordering scheme : No reordering" else if (reorder == 1) then print '(1x,a)', "Reordering scheme : symrcm" else if (reorder == 2) then print '(1x,a)', "Reordering scheme : symamd" end if print '(/1x,a,d15.10,2x,a,i8/)',"Tolerance to decide if singular or not = ", tol ! set b = rhs, where Ax = rhs (Ax = b) b(:) = rhs(:) k = 0 ! Create cusolver/cusparse handle, and matrix descriptor status = cusolverSpCreate(cusolver_H) status = cusparseCreate(cusparse_H) status = cusparseCreateMatDescr(descrA) status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatDiagType(descrA, CUSPARSE_DIAG_TYPE_NON_UNIT) status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base=1 ! Verify if A has symmetric pattern or not status = cusolverSpXcsrissymHost( & cusolver_H, m, nz, descrA, csrRowPtrA(1), csrRowPtrA(2), csrColIndA, issym) if (issym == 0) then print *, "A matrix has no symmmetric pattern." else print *, "A matrix has symmmetric pattern." end if print '(1x,a)',"==================================================================" if(idbg .eq. 1) print *, "cusolverSpXcsrissymHost-status=", status ! Clock init status = init_clock if (gpuonly == 1 ) goto 100 ! Clock start status = start_clock if ( solver == 1 ) then ! cholesky status = cusolverSpDcsrlsvcholHost(cusolver_H, m, nz, & descrA, csrValA, csrRowPtrA, csrColIndA, & b, tol, reorder, x, singularity) if(idbg .ge. 1) then print * print *,"singularity = ", singularity print *,"cusolverSpDcsrlsvcholHost-status = ", status end if else if ( solver == 2 ) then ! LU status = cusolverSpDcsrlsvluHost(cusolver_H, m, nz, & descrA, csrValA, csrRowPtrA, csrColIndA, & b, tol, reorder, x, singularity) if(idbg .ge. 1) then print * print *,"cusolverSpDcsrlsvluHost-status = ", status print *,"singularity = ", singularity end if else if ( solver == 3 ) then ! QR status = cusolverSpDcsrlsvqrHost(cusolver_H, m, nz, & descrA, csrValA, csrRowPtrA, csrColIndA, & b, tol, reorder, x, singularity) if(idbg .eq. 1) then print * print *,"cusolverSpDcsrlsvqrHost-status = ", status print *,"singularity = ", singularity end if else print *,"Error: Solver",solver," is unknown solver" end if status = end_clock time_tag = "cuSOLVER SP solver result on [ CPU ]" status = print_time(time_tag) !$acc data copy(b, X) copyin( csrValA, csrRowPtrA, csrColIndA ) !$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA) ! evaluate residual b = b - A*x (result on CPU) status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b) !$acc end host_data !$acc end data if ( singularity .ge. 0 ) & print '(/2x,a,i5,,2x,a, d10.3/)', "WARNING!! : the matrix is singular at row", singularity+1, "under tolerance", tol r_inf = vec_max (m, b) x_inf = vec_max (n, x) A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA) print '(2x,a,d20.13)',"|b - A*x| =", r_inf print '(2x,a,d20.13)'," |A| =", A_inf print '(2x,a,d20.13)'," |x| =", x_inf print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf) ! output result print * nprint = n if ( n > 10 ) nprint = 10 do i = 1, nprint print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do print *,"" do i = n-nprint, n print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do ! reset b = rhs, where Ax = rhs (Ax = b) b(:) = rhs(:) x = 0.0_fp_kind ! reset X 100 continue ! Clock start status = start_clock !$acc data copy(b, X) copyin( csrValA, csrRowPtrA, csrColIndA ) !$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA) if ( solver == 1 ) then ! cholesky status = cusolverSpDcsrlsvchol(cusolver_H, m, nz, & descrA, csrValA, csrRowPtrA, csrColIndA, & b, tol, reorder, x, singularity) if(idbg .eq. 1) then print * print *,"singularity = ", singularity print *,"cusolverSpDcsrlsvchol-status = ", status end if else if ( solver == 2 ) then ! LU print '(/a)'," LU SP solver for GPU is not available now." print *,"-------------------------------------------" stop else if ( solver == 3 ) then ! QR status = cusolverSpDcsrlsvqr(cusolver_H, m, nz, & descrA, csrValA, csrRowPtrA, csrColIndA, & b, tol, reorder, x, singularity) if(idbg .eq. 1) then print * print *,"cusolverSpDcsrlsvqr-status = ", status print *,"singularity = ", singularity end if else print *,"Error: Solver",solver," is unknown solver" end if !$acc end host_data !$acc end data status = end_clock time_tag = "cuSOLVER SP solver result on [ GPU ]" status = print_time(time_tag) !$acc data copy(b, X) copyin( csrValA, csrRowPtrA, csrColIndA ) !$acc host_data use_device(b, x, csrValA, csrRowPtrA, csrColIndA) ! evaluate residual b = b - A*x (result on CPU) status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & m, n, nz, onem1, descrA, csrValA, csrRowPtrA, csrColIndA, x, one, b) !$acc end host_data !$acc end data if ( singularity .ge. 0 ) & print '(/2x,a,i5,,2x,a, d10.3/)', "Warning!! : the matrix is singular at row", singularity+1, "under tolerance", tol r_inf = vec_max (m, b) x_inf = vec_max (n, x) A_inf =csr_mat_max(m, n, nz, csrValA, csrRowPtrA, csrColIndA) print '(2x,a,d20.13)',"|b - A*x| =", r_inf print '(2x,a,d20.13)'," |A| =", A_inf print '(2x,a,d20.13)'," |x| =", x_inf print '(2x,a,d20.13)',"|b - A*x|/(|A|*|x|) =", r_inf/(A_inf * x_inf) ! output result print * nprint = n if ( n > 10 ) nprint = 10 do i = 1, nprint print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do print *,"" do i = n-nprint, n print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do status = cusparseDestroy(cusparse_H) status = cusparseDestroyMatDescr(descrA) end subroutine cusolverSPs End module SP_solvers
tar file 内に提供した Makefile の記述に沿って、コンパイル&リンクを行う。なお、この Makefile では、cuda9.0 ツールキットのライブラリを使用するように設定しているが、cuda8.0 を使用しても良い。cusparse と cublas ライブラリは、PGIコンパイラにバンドルしているルーチンを使用する形になっている。
$ make pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda precision.f90 pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed mmio.f mmread: 154, Loop not vectorized/parallelized: potential early exits 199, Loop not vectorized/parallelized: potential early exits 203, Loop not vectorized/parallelized: potential early exits 207, Loop not vectorized/parallelized: potential early exits 212, Loop not vectorized/parallelized: potential early exits 251, Loop not vectorized/parallelized: potential early exits 255, Loop not vectorized/parallelized: potential early exits 259, Loop not vectorized/parallelized: potential early exits mminfo: 455, Loop not vectorized/parallelized: potential early exits mmwrite: 678, Loop not vectorized/parallelized: contains call 682, Loop not vectorized/parallelized: contains call 686, Loop not vectorized/parallelized: contains call 692, Loop not vectorized/parallelized: contains call 709, Loop not vectorized/parallelized: contains call 713, Loop not vectorized/parallelized: contains call 718, Loop not vectorized/parallelized: contains call lowerc: 771, Loop not vectorized/parallelized: contains call getwd: 795, Loop not vectorized/parallelized: potential early exits countwd: 826, Loop not vectorized/parallelized: contains call pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda cusolverSP_mod.cuf pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda measure_time.f90 pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda mm_convert_csr.f90 matrix_printout: 70, Loop not vectorized/parallelized: contains call 79, Loop not vectorized/parallelized: contains call set_matrix: 140, Unrolled inner loop 8 times Generated 2 prefetch instructions for the loop 150, Loop not vectorized/parallelized: contains call 171, Memory copy idiom, loop replaced by call to __c_mcopy4 172, Memory copy idiom, loop replaced by call to __c_mcopy4 175, Memory copy idiom, loop replaced by call to __c_mcopy8 179, Memory copy idiom, loop replaced by call to __c_mcopy4 183, Memory copy idiom, loop replaced by call to __c_mcopy16 coo2csr_convert: 361, Memory zero idiom, loop replaced by call to __c_mzero8 Memory zero idiom, loop replaced by call to __c_mzero4 364, Generated an alternate version of the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop 374, Generating copy(csrvala(:),indx(:),rval(:),jndx(:),csrrowptra(:)) 392, Generating enter data copyin(p(:),pbuffer(:)) 411, Generating exit data delete(p(:),pbuffer(:)) 415, Loop is parallelizable Accelerator kernel generated Generating Tesla code 415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 431, Memory copy idiom, loop replaced by call to __c_mcopy4 pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda sp_solvers.f90 vec_max: 33, Generated an alternate version of the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop csr_mat_max: 51, Generated an alternate version of the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop cusolversps: 139, Memory copy idiom, loop replaced by call to __c_mcopy8 207, Generating copyin(csrvala(:)) Generating copy(x(:)) Generating copyin(csrrowptra(:),csrcolinda(:)) Generating copy(b(:)) 229, Loop not vectorized/parallelized: contains call 233, Loop not vectorized/parallelized: contains call 238, Memory copy idiom, loop replaced by call to __c_mcopy8 245, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:)) Generating copy(b(:),x(:)) 283, Generating copyin(csrvala(:)) Generating copy(x(:)) Generating copyin(csrrowptra(:),csrcolinda(:)) Generating copy(b(:)) 305, Loop not vectorized/parallelized: contains call 309, Loop not vectorized/parallelized: contains call pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda main.f90 main: 66, Memory set idiom, loop replaced by call to __c_mset8 98, Memory zero idiom, loop replaced by call to __c_mzero8 pgf90 -o a.out precision.o mmio.o cusolverSP_mod.o measure_time.o mm_convert_csr.o sp_solvers.o main.o -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda -Mcudalib=cusparse,cusolver
サンプル係数行列(mtxデータ)については、前章の説明 を参照して欲しい。提供した tar file の中には、サンプルとして lap3D_7pt_n20.mtx 等の mtx 形式の対称行列ファイルを含んでいるので、これをテスト用データして使用してほしい。
make run コマンドで実行が開始される。入力 mtx ファイルが求められるので、入手した mtx ファイル名を指定する。次に rhs ベクトルファイルが存在する場合は、そのファイル名を指定する。存在しない場合は、enter を押下し、実行に進む。以下の例では、symrcm リオーダリングを施して、コレスキー分解にて計算した結果である。Intel Haswell CPU 上の実行では 4スレッドで 47 秒、1スレッドでは 160 秒の時間が掛かるが、GPU 上では geforce 1080ti という倍精度演算器の極めて少ない ボードを使用しても 19 秒程度で終了する。
[kato@photon32]$ export OMP_NUM_THREADS=4 (CPU版の実行では4スレッド並列で実行する) [kato@photon32]$ make run $ make run ===== Program run ===== a.out 0 = CPU and GPU, 1 = GPU only? CPU+ GPU 計算か、 GPUのみの計算? 0 1 = cholesky, 2 = LU, 3 = QR ; What solver? ソルバーの種類選択 1 0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering? Reordering scheme の選択 1 ================================================= Sparse matrix linear solver schem : Cholesky Reordering scheme of matrix : symrcm ================================================= File name for Matrix Market? 行列の入力ファイル名 /home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx Matrix Market file name = /home/kato/GPGPU/cuLIBs/CG/Matrix_Market/DATA/parabolic_fem/parabolic_fem.mtx Properties of Matrix_Market matrix ==================================== representation = coordinate type = real matrix type = symmetric row (m) = 525825 cols(n) = 525825 Non-zero elm. = 3674625 extendSymMatrix= 1 ======================================================================================= NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers. ======================================================================================= File name for LHS vector? if no available, press ENTER and set as RHS=1.0. ! RHS(右辺ベクトル)の用意がないため、enter i,indx,jndx 1 1 1 0.1000003178914388 i,indx,jndx 2 1 523522 -4.9999841054280604E-002 i,indx,jndx 3 1 523523 -4.9999841054280604E-002 i,indx,jndx 4 2 2 0.2000009536743164 i,indx,jndx 5 2 523905 -4.9999841054280604E-002 i,indx,jndx 6 2 523906 3.1789143880208332E-007 i,indx,jndx 7 2 524290 -9.9999682108561208E-002 i,indx,jndx 8 2 524802 -4.9999841054280604E-002 i,indx,jndx 9 3 3 0.1000006357828777 i,indx,jndx 10 3 525057 -4.9999841054280646E-002 i,indx,jndx 3674615 525823 525823 0.2000009536743135 i,indx,jndx 3674616 525824 33153 -4.9999841054280535E-002 i,indx,jndx 3674617 525824 131841 -4.9999841054276260E-002 i,indx,jndx 3674618 525824 328447 -9.9999682108561472E-002 i,indx,jndx 3674619 525824 426751 3.1789143884642877E-007 i,indx,jndx 3674620 525824 525824 0.2000009536743123 i,indx,jndx 3674621 525825 5 -4.9999841054280604E-002 i,indx,jndx 3674622 525825 131841 -4.9999841054280694E-002 i,indx,jndx 3674623 525825 328448 3.1789143862453213E-007 i,indx,jndx 3674624 525825 525313 -9.9999682108558099E-002 i,indx,jndx 3674625 525825 525825 0.2000009536743136 == CSR data check : Start from 1 to 10 elements == csrRowPtrA= 1 4 9 13 16 21 25 30 37 42 csrColIndA= 1 523522 523523 2 523905 523906 524290 524802 3 525057 csrValA = 0.1000D+00 -.5000D-01 -.5000D-01 0.2000D+00 -.5000D-01 0.3179D-06 -.1000D+00 -.5000D-01 0.1000D+00 -.5000D-01 == Start from endpoint-10 to endpoint == csrRowPtrA= 3674576 3674581 3674586 3674591 3674596 3674601 3674606 3674611 3674616 3674621 3674626 csrColIndA= 525823 33153 131841 328447 426751 525824 5 131841 328448 525313 525825 csrValA= 0.2000D+00 -.5000D-01 -.5000D-01 -.1000D+00 0.3179D-06 0.2000D+00 -.5000D-01 -.5000D-01 0.3179D-06 -.1000D+00 0.2000D+00 ================================================================== The linear system is solved by sparse CHOLESKY factorization. Reordering scheme : symrcm Tolerance to decide if singular or not = .1000000000D-11 A matrix has symmmetric pattern. ================================================================== 4スレッド並列時 cuSOLVER SP solver result on [ CPU ] 47.714441 (sec) ================================================================================================== |b - A*x| = 0.3055902197957D-09 |A| = 0.8000012715659D+00 |x| = 0.2638521162680D+06 |b - A*x|/(|A|*|x|) = 0.1447732059137D-14 x( 1)=0.263852116268E+06 x( 2)=0.263093899026E+06 x( 3)=0.263844899226E+06 x( 4)=0.263852116268E+06 x( 5)=0.263093899027E+06 x( 6)=0.263844899225E+06 x( 7)=0.263199487139E+06 x( 8)=0.262666049525E+06 x( 9)=0.263304916648E+06 x( 10)=0.262560622769E+06 x( 525815)=0.263094481758E+06 x( 525816)=0.263094376198E+06 x( 525817)=0.263094281209E+06 x( 525818)=0.263094196778E+06 x( 525819)=0.263094122892E+06 x( 525820)=0.263094059539E+06 x( 525821)=0.263094006710E+06 x( 525822)=0.263093964396E+06 x( 525823)=0.263093932591E+06 x( 525824)=0.263093911290E+06 x( 525825)=0.263093900489E+06 cuSOLVER SP solver result on [ GPU ] 19.054326 (sec) ================================================================================================== |b - A*x| = 0.1527951098979D-09 |A| = 0.8000012715659D+00 |x| = 0.2638521162680D+06 |b - A*x|/(|A|*|x|) = 0.7238660295685D-15 x( 1)=0.263852116268E+06 x( 2)=0.263093899027E+06 x( 3)=0.263844899226E+06 x( 4)=0.263852116268E+06 x( 5)=0.263093899027E+06 x( 6)=0.263844899226E+06 x( 7)=0.263199487139E+06 x( 8)=0.262666049525E+06 x( 9)=0.263304916649E+06 x( 10)=0.262560622769E+06 x( 525815)=0.263094481758E+06 x( 525816)=0.263094376198E+06 x( 525817)=0.263094281209E+06 x( 525818)=0.263094196778E+06 x( 525819)=0.263094122892E+06 x( 525820)=0.263094059539E+06 x( 525821)=0.263094006710E+06 x( 525822)=0.263093964396E+06 x( 525823)=0.263093932591E+06 x( 525824)=0.263093911290E+06 x( 525825)=0.263093900490E+06
上記で示したプログラム一式は、Linux 上で動作させることを前提としたものである。同じプログラムを Windows 上で動作させるために Makefile を変更したものとソース・ファイル一式を以下に提供する。以下のファイルをダウンロードしていただきたい。Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI 用の cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。
プログラム tar ファイル:cuSOLVERsp_solvers.zip
Linux 版の環境と異なる点は、PGI 17.10 において、Windows 版には PGI が提供する cuSOLVER ライブラリのバンドルがないため、NVIDIA CUDA toolkit 9.0(8.0) のインストールで実装されたオリジナル cuSOLVER ライブラリをリンクするようにしている点である。また、cuSOLVER SP ルーチンの Windows 版は、CPU 実行用のソルバー・ルーチンに対して、OpenMP 並列マルチスレッド対応がなされていないことである。CPU上のソルバーはシングル・スレッドで実行される。
以下の例は、PGI 17.10 バージョンの PGI コマンドプロンプト(端末)を使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Windows -tp sandybridge PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 Device Number: 0 Device Name: GeForce GTX 780 Device Revision Number: 3.5 Global Memory Size: 3221225472 Number of Multiprocessors: 12 Number of SP Cores: 2304 Number of DP Cores: 768 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 941 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 3004 MHz Memory Bus Width: 384 bits L2 Cache Size: 1572864 bytes Max Threads Per SMP: 2048 Async Engines: 1 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc35
PGI$ make pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda precision.f90 pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mfixed mmio.f mmread: 154, Loop not vectorized/parallelized: potential early exits 199, Loop not vectorized/parallelized: potential early exits 203, Loop not vectorized/parallelized: potential early exits 207, Loop not vectorized/parallelized: potential early exits 212, Loop not vectorized/parallelized: potential early exits 251, Loop not vectorized/parallelized: potential early exits 255, Loop not vectorized/parallelized: potential early exits 259, Loop not vectorized/parallelized: potential early exits mminfo: 455, Loop not vectorized/parallelized: potential early exits mmwrite: 678, Loop not vectorized/parallelized: contains call 682, Loop not vectorized/parallelized: contains call 686, Loop not vectorized/parallelized: contains call 692, Loop not vectorized/parallelized: contains call 709, Loop not vectorized/parallelized: contains call 713, Loop not vectorized/parallelized: contains call 718, Loop not vectorized/parallelized: contains call lowerc: 771, Loop not vectorized/parallelized: contains call getwd: 795, Loop not vectorized/parallelized: potential early exits countwd: 826, Loop not vectorized/parallelized: contains call pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -Mcuda cusolverSP_mod.cuf pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda measure_time.f90 pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda mm_convert_csr.f90 matrix_printout: 70, Loop not vectorized/parallelized: contains call 79, Loop not vectorized/parallelized: contains call set_matrix: 140, Unrolled inner loop 8 times Generated 2 prefetch instructions for the loop 150, Loop not vectorized/parallelized: contains call 171, Memory copy idiom, loop replaced by call to __c_mcopy4 172, Memory copy idiom, loop replaced by call to __c_mcopy4 175, Memory copy idiom, loop replaced by call to __c_mcopy8 179, Memory copy idiom, loop replaced by call to __c_mcopy4 183, Memory copy idiom, loop replaced by call to __c_mcopy16 coo2csr_convert: 361, Memory zero idiom, loop replaced by call to __c_mzero8 Memory zero idiom, loop replaced by call to __c_mzero4 364, Generated 3 alternate versions of the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop 374, Generating copy(csrvala(:),indx(:),jndx(:),rval(:),csrrowptra(:)) 392, Generating enter data copyin(p(:),pbuffer(:)) 411, Generating exit data delete(pbuffer(:),p(:)) 415, Loop is parallelizable Accelerator kernel generated Generating Tesla code 415, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 431, Memory copy idiom, loop replaced by call to __c_mcopy4 pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda sp_solvers.f90 vec_max: 33, Generated 2 alternate versions of the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop csr_mat_max: 51, Generated 2 alternate versions of the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop cusolversps: 139, Memory copy idiom, loop replaced by call to __c_mcopy8 207, Generating copy(x(:)) Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:)) Generating copy(b(:)) 229, Loop not vectorized/parallelized: contains call 233, Loop not vectorized/parallelized: contains call 238, Memory copy idiom, loop replaced by call to __c_mcopy8 239, Memory zero idiom, loop replaced by call to __c_mzero8 246, Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:)) Generating copy(b(:),x(:)) 284, Generating copy(x(:)) Generating copyin(csrvala(:),csrrowptra(:),csrcolinda(:)) Generating copy(b(:)) 306, Loop not vectorized/parallelized: contains call 310, Loop not vectorized/parallelized: contains call pgf90 -c -Minfo -Kieee -O2 -Mlarge_arrays -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcuda main.f90 main: 66, Memory set idiom, loop replaced by call to __c_mset8 98, Memory zero idiom, loop replaced by call to __c_mzero8 pgf90 -o a.out precision.obj mmio.obj cusolverSP_mod.obj measure_time.obj mm_convert_csr.obj sp_solvers.obj main.obj -acc -ta=tesla, cuda9.0,cc35,cc60 -Mcuda -L"C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v9.0\lib\x64" -Mcudalib=cusparse,cusolver
PGI$ make run ===== Program run ===== a.out 0 = CPU and GPU, 1 = GPU only? 0 1 = cholesky, 2 = LU, 3 = QR ; What solver? 3 0 = No, schemes 1 = symrcm, 2 = symamd ; What kind of reordering? 1 ================================================= Sparse matrix linear solver schem : QR Reordering scheme of matrix : symrcm ================================================= File name for Matrix Market? lap3D_7pt_n20.mtx Matrix Market file name = lap3D_7pt_n20.mtx Properties of Matrix_Market matrix ==================================== representation = coordinate type = real matrix type = symmetric row (m) = 8000 cols(n) = 8000 Non-zero elm. = 53600 extendSymMatrix= 1 ======================================================================================= NOTICE: Triangular Symmetric matrix has been transformed to full matrix for SP solvers. ======================================================================================= File name for LHS vector? if no available, press ENTER and set as RHS=1.0. i,indx,jndx 1 1 1 6.000000000000000 i,indx,jndx 2 1 2 -1.000000000000000 i,indx,jndx 3 1 21 -1.000000000000000 i,indx,jndx 4 1 401 -1.000000000000000 i,indx,jndx 5 2 1 -1.000000000000000 i,indx,jndx 6 2 2 6.000000000000000 i,indx,jndx 7 2 3 -1.000000000000000 i,indx,jndx 8 2 22 -1.000000000000000 i,indx,jndx 9 2 402 -1.000000000000000 i,indx,jndx 10 3 2 -1.000000000000000 i,indx,jndx 53590 7998 7998 6.000000000000000 i,indx,jndx 53591 7998 7999 -1.000000000000000 i,indx,jndx 53592 7999 7599 -1.000000000000000 i,indx,jndx 53593 7999 7979 -1.000000000000000 i,indx,jndx 53594 7999 7998 -1.000000000000000 i,indx,jndx 53595 7999 7999 6.000000000000000 i,indx,jndx 53596 7999 8000 -1.000000000000000 i,indx,jndx 53597 8000 7600 -1.000000000000000 i,indx,jndx 53598 8000 7980 -1.000000000000000 i,indx,jndx 53599 8000 7999 -1.000000000000000 i,indx,jndx 53600 8000 8000 6.000000000000000 == CSR data check : Start from 1 to 10 elements == csrRowPtrA= 1 5 10 15 20 25 30 35 40 45 csrColIndA= 1 2 21 401 1 2 3 22 402 2 csrValA = 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 == Start from endpoint-10 to endpoint == csrRowPtrA= 53552 53557 53562 53567 53572 53577 53582 53587 53592 53597 53601 csrColIndA= 7998 7999 7599 7979 7998 7999 8000 7600 7980 7999 8000 csrValA= 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 -.1000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.6000D+01 ================================================================== The linear system is solved by sparse QR factorization. Reordering scheme : symrcm Tolerance to decide if singular or not = .1000000000D-11 A matrix has symmmetric pattern. ================================================================== cuSOLVER SP solver result on [ CPU ] 2.355999 (sec) ================================================================================================== |b - A*x| = 0.5115907697473D-12 |A| = 0.1200000000000D+02 |x| = 0.2458019372584D+02 |b - A*x|/(|A|*|x|) = 0.1734427507818D-14 x( 1)=0.666997357463E+00 x( 2)=0.100066138159E+01 x( 3)=0.119691025971E+01 x( 4)=0.132458774155E+01 x( 5)=0.141248710662E+01 x( 6)=0.147457380480E+01 x( 7)=0.151841489418E+01 x( 8)=0.154840911171E+01 x( 9)=0.156717865908E+01 x( 10)=0.157622105444E+01 x( 7990)=0.157622105444E+01 x( 7991)=0.157622105444E+01 x( 7992)=0.156717865908E+01 x( 7993)=0.154840911171E+01 x( 7994)=0.151841489418E+01 x( 7995)=0.147457380480E+01 x( 7996)=0.141248710662E+01 x( 7997)=0.132458774155E+01 x( 7998)=0.119691025971E+01 x( 7999)=0.100066138159E+01 x( 8000)=0.666997357463E+00 cuSOLVER SP solver result on [ GPU ] 0.683000 (sec) ================================================================================================== |b - A*x| = 0.4547473508865D-12 |A| = 0.1200000000000D+02 |x| = 0.2458019372584D+02 |b - A*x|/(|A|*|x|) = 0.1541713340283D-14 x( 1)=0.666997357463E+00 x( 2)=0.100066138159E+01 x( 3)=0.119691025971E+01 x( 4)=0.132458774155E+01 x( 5)=0.141248710662E+01 x( 6)=0.147457380480E+01 x( 7)=0.151841489418E+01 x( 8)=0.154840911171E+01 x( 9)=0.156717865908E+01 x( 10)=0.157622105444E+01 x( 7990)=0.157622105444E+01 x( 7991)=0.157622105444E+01 x( 7992)=0.156717865908E+01 x( 7993)=0.154840911171E+01 x( 7994)=0.151841489418E+01 x( 7995)=0.147457380480E+01 x( 7996)=0.141248710662E+01 x( 7997)=0.132458774155E+01 x( 7998)=0.119691025971E+01 x( 7999)=0.100066138159E+01 x( 8000)=0.666997357463E+00