前章 12-2 では、OpenACC + cuSPARSE を利用した Bi-Conjugate Gradient Stabilized(BiCGStab)実装プログラム例を示した。このルーチンを汎用的に使用できるように、疎行列のフォーマット変換も含んだ形でプログラムを組み直した。NVIDIA CUDA 線形ライブラリでは、計算に使用するスパース行列のフォーマットが CSR タイプを基本とするため、CUDA ライブラリを使用する場合は、事前に、CSR フォーマットに変換する必要がある。この項では、cuSPARSE ライブラリを使用してデータ・フォーマットの変換を行い、デバイス側で BiCGStab 反復法で線形方程式を解く Fortran プログラムの例を示すこととする。ここで例示するソース・ファイルと Makefile 一式をアーカイブした tar ファイルを提供する。以下のファイルをダウンロードしていただきたい。
プログラム tar ファイル:openacc_bicgstab.tar.gz (2017.11.14更新しました。プログラム修正)
以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.59 Wed Jul 19 23:53:34 PDT 2017 Device Number: 0 Device Name: GeForce GTX 1080 Ti Device Revision Number: 6.1 Global Memory Size: 11714691072 Number of Multiprocessors: 28 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 1657 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 5505 MHz Memory Bus Width: 352 bits L2 Cache Size: 2883584 bytes Max Threads Per SMP: 2048 Async Engines: 2 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc60
スパース行列を表現するフォーマットは、様々提案されているが、ここでは、サンプル係数行列の流通に使用されている Matrix Market の MTX フォーマットを入力として CSR フォーマットに変換するプログラムを紹介する。サンプル係数行列を提供しているサイトとして有名なのは、他に、The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) があり、ここでのデータを利用して線形システムのテスト実行が可能である。これらのサイトでは、MTXフォーマットのデータの提供を行っている。
Matrix Market(MTX) フォーマットは、ヘッダー部分を除けば、非ゼロ要素の配列位置型(coordinate) で表す、いわゆる COO 形式となっている。COO フォーマットとは、(行, 列, 値) のタプルでリストを作る方法である。CUDA ライブラリ関数へ入力する係数行列のストレージ・フォーマットは CSR 形式としているため、COO 形式を CSR 形式に変換する必要がある。
提供プログラムの注意点を述べる。Matrix Market のフォーマットで「symmetric」属性を持つファイルは、対称行列の上三角行列のみ収納されている。一方、「general」属性を持つものは、「一般行列」を意味しノンゼロの配列要素が全て記述されている。mm_convert_csr.f90 プログラムは、デフォルトで「symmetric」属性を有する対称行列要素を「一般行列」に拡張するようにしている。プログラム内で、対称行列を一般行列に拡張するためのフラグ変数は extendSymMatrix であり、これにより制御している。「一般行列」に拡張する理由は、BiCGstab の前処理に不完全 LU 分解 (cusparseDcsrilu0) を施しているが、この cusparseDcsrilu0 ルーチンの入力仕様が当然ながら「一般行列」の入力を必要とするからである。
Matrix Market サイトでは、MTXフォーマットのファイルを読み込むための fortran ルーチン(mmio.f) が提供されている。今回、tar ファイル内に含めた mmio.f は、サブルーチンの仮引数にオプショナル引数を使用できるように変更し、F90 MODULE形式にしたものとした。当該 mmio.f ルーチンは、mtx ファイルのヘッダ情報を読み、COO 形式のデータを読み込むためのルーチンとなる。
以下に示すプログラム(mm_convert_csr.f90) は、MTX ファイルを CSR 形式のデータに変換するものである。データ形式の変換は、cuSPARSE ライブラリの関数を利用するため、デバイス側での処理を含む。このため、デバイス・データであることを指示するために OpenACC ディレクティブを利用する。以下の表は、mm_convert_csr.f90 内の主なルーチンの機能を示したものである。
module Matrix_data | COO形式の配列バッファ宣言とCSR形式のデータ配列宣言、結果ベクトル配列、RHS ベクトル配列宣言 |
subroutine mm_matrix_read | MTX形式のデータの読み込みルーチン |
subroutine mm_rhs_read | 線形方程式の右辺項ベクトルの読み込みルーチン |
subroutine coo2csr_convert | COO形式データをCSR形式データに変換するルーチン。配列の index base は 1 として変換する。cuSPARSE ライブラリを使用するため、デバイス側での処理も含むため、OpenACC 指示行を使用する。なお、このルーチンは、データ型として倍精度の対応のみ行なっており、複素数型を扱うには修正が必要である。なお、このルーチンの処理後、csrValA、csrRowPtrA、csrColIndA の CSR データの配列要素は、ホスト側にもデバイス側にも配置された状態にあなる。 |
CSR フォーマットの詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。
nz | 整数 | 行列内の非ゼロ要素の数 |
csrValA | 配列 | 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format. |
csrRowPtrA | 配列 | 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる。ここでは 1 を使用する。 |
csrColIndA | 配列 | 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列 |
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved. ! ! Permission is hereby granted, free of charge, to any person obtaining a ! copy of this software and associated documentation files (the "Software"), ! to deal in the Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, sublicense, ! and/or sell copies of the Software, and to permit persons to whom the ! Software is furnished to do so, subject to the following conditions: ! ! The above copyright notice and this permission notice shall be included in ! all copies or substantial portions of the Software. ! ! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR ! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL ! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER ! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING ! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER ! DEALINGS IN THE SOFTWARE. ! module Matrix_data ! COO storage data and CSR data -- declare variables use precision implicit none ! Buffer for Matrix Market data format (as we say "COO" format) integer :: rows, cols, nnz, nnzmax character*10 :: rep character*7 :: field character*19 :: symm integer, allocatable, dimension(:) :: indx, jndx integer, allocatable, dimension(:) :: ival real(fp_kind), allocatable, dimension(:) :: rval complex*16, allocatable, dimension(:) :: cval ! Arrays for CUDA cuSPARSE CSR format (available for REAL data only) integer :: m ! rows size of A matrix integer :: n ! columns size of A matrix integer :: nz ! nonzero element of A integer, allocatable, dimension(:) :: csrRowPtrA ! m + 1 integer, allocatable, dimension(:) :: csrColIndA ! real(fp_kind), allocatable, dimension(:) :: csrValA ! real(fp_kind), allocatable, dimension(:) :: x ! initial guess & result vector real(fp_kind), allocatable, dimension(:) :: rhs ! right hand side vector end module Matrix_data module mm_coo_cuda_csr_convert use precision use Matrix_data use mmio implicit none contains subroutine matrix_printout(f, nonzero, i_ndx, j_ndx, rval1, ival1, cval1) implicit none integer :: i, nonzero character*7 :: f integer :: i_ndx(nonzero) integer :: j_ndx(nonzero) integer, optional :: ival1(nonzero) double precision, optional :: rval1(nonzero) complex*16, optional :: cval1(nonzero) do i = 1, min(nonzero,10) if (f == "complex") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),cval1(i) elseif (f == "real") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),rval1(i) elseif (f == "integer") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),ival1(i) end if end do do i = max(nonzero-10, 1), nonzero if (f == "complex") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),cval1(i) elseif (f == "real") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),rval1(i) elseif (f == "integer") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),ival1(i) end if end do end subroutine matrix_printout subroutine csr_printout implicit none integer :: i print * print *,"== CSR data check : Start from 1 to 10 elements ==" if (field == "complex") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=1,min(m+1,10)) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=1,min(nz,10)) print '(1x a/, 7(1x,2d10.4))',"csrValA =",(csrValA(i), i=1,min(nz,10)) elseif (field == "real") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=1,min(m+1,10)) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=1,min(nz,10)) print '(1x, a/, 7(1x,d10.4))',"csrValA =",(csrValA(i), i=1,min(nz,10)) end if print * print *,"== Start from endpoint-10 to endpoint ==" if (field == "complex") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=max(m+1-10,1),m+1) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=max(nz-10,1),nz) print '(1x, a/, 7(1x,2d10.4))',"csrValA=",(csrValA(i), i=max(nz-10,1),nz) elseif (field == "real") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=max(m+1-10,1), m+1) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=max(nz-10,1),nz) print '(1x, a/, 7(1x,d10.4))',"csrValA=",(csrValA(i), i=max(nz-10,1),nz) end if end subroutine csr_printout subroutine set_matrix(iunit1, extendSymMatrix, field, symm, t_nnz, t_indx, t_jndx, & t_ival, t_rval, t_cval) use precision use Matrix_data character*7 :: field character*19 :: symm integer :: t_nnz integer ,optional :: t_indx(t_nnz), t_jndx(t_nnz) integer, optional :: t_ival(t_nnz) real(fp_kind), optional :: t_rval(t_nnz) complex*16, optional :: t_cval(t_nnz) integer :: i, j, iunit1, count integer :: extendSymMatrix ! 1 = extend , 0 = no if (symm == "symmetric" .and. extendSymMatrix == 1) then ! count number of non-diagonal elements count = 0 do i = 1, t_nnz if (t_indx(i) /= t_jndx(i)) count=count+1 end do allocate(indx(count + t_nnz)) allocate(jndx(count + t_nnz)) if (field == "real" ) allocate(rval(count + t_nnz)) if (field == "integer") allocate(ival(count + t_nnz)) if (field == "complex") allocate(cval(count + t_nnz)) ! copy the elements regular and transposed locations j = 1 do i = 1, t_nnz indx(j) = t_indx(i) jndx(j) = t_jndx(i) if (field == "real" ) rval(j) = t_rval(i) if (field == "integer") ival(j) = t_ival(i) if (field == "complex") cval(j) = t_cval(i) j = j + 1 if (t_indx(i) /= t_jndx(i)) then indx(j) = t_jndx(i) jndx(j) = t_indx(i) if (field == "real" ) rval(j) = t_rval(i) if (field == "integer") ival(j) = t_ival(i) if (field == "complex") cval(j) = t_cval(i) j = j + 1 end if end do t_nnz = t_nnz + count else allocate(indx(t_nnz)) allocate(jndx(t_nnz)) allocate(cval(t_nnz)) indx(:) = t_indx(:) jndx(:) = t_jndx(:) if (field == "real" ) then allocate(rval(t_nnz)) rval(:) = t_rval(:) end if if (field == "integer") then allocate(ival(t_nnz)) ival(:) = t_ival(:) end if if (field == "complex") then allocate(cval(t_nnz)) cval(:) = t_cval(:) end if end if end subroutine set_matrix subroutine mm_matrix_read(iunit1, extendSymMatrix) ! Read Matrix Market format file implicit none integer, allocatable, dimension(:) :: t_indx, t_jndx integer, allocatable, dimension(:) :: t_ival real(fp_kind), allocatable, dimension(:) :: t_rval complex*16, allocatable, dimension(:) :: t_cval integer :: i, iunit1, count integer :: extendSymMatrix ! 1 = extend , 0 = none ! Read header infos call mminfo(iunit1, rep, field, symm, rows, cols, nnz) ! small test data ! ! ! | 1 2 0 | ! ! A = | 0 5 0 | base = 1 ! ! | 0 8 0 | ! rep = "coordinate" ! field = "real" ! symm ="general" ! rows = 3 ; cols = 3 ; nnz = 4 ! indx(1:4) = (/3, 2, 1, 1/) ! jndx(1:4) = (/2, 2, 1, 2/) ! rval(1:4) =(/8.0, 5.0, 1.0, 2.0/) ! for small test data nnzmax = nnz allocate (t_indx(nnz), t_jndx(nnz)) if (field == "complex") then allocate (t_cval(nnz)) call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, & t_indx, t_jndx, cval=t_cval) call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, & t_cval=t_cval) elseif (field == "real") then allocate (t_rval(nnz)) call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, & t_indx, t_jndx, rval=t_rval) call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, & t_rval=t_rval) elseif (field == "integer") then allocate (t_ival(nnz)) call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, & t_indx, t_jndx, ival=t_ival) call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, & t_ival=t_ival) elseif (field == "pattern") then print *, "Pattern field is not available." ; stop "10" else print *, "Error : data type not found." , field ; stop "10" end if deallocate(t_indx,t_jndx) if (field == "real" ) deallocate(t_rval) if (field == "integer") deallocate(t_ival) if (field == "complex") deallocate(t_cval) print *,"Properties of Matrix_Market matrix" print *,"====================================" print '(3(1x,a,1x,a/),4(1x,a,i8/))', & "representation =",rep, & "type =",field,& "matrix type =",symm, & "row (m) =",rows, & "cols(n) =",cols, & "Non-zero elm. =",nnz, & "extendSymMatrix=",extendSymMatrix if (symm == "symmetric" .and. extendSymMatrix == 1) & print '(/a/)', " ==== Triangular Symmetric matrix has been transformed to full matrix form ====" if (field == "complex") then call matrix_printout (field, nnz, indx, jndx, cval1=cval) elseif (field == "real") then call matrix_printout (field, nnz, indx, jndx, rval1=rval) elseif (field == "integer") then call matrix_printout (field, nnz, indx, jndx, ival1=ival) print *, "No available in case of field = integer in this routine" stop "20" elseif (field == "pattern") then print *, "No available in case of field = pattern in this routine" stop "30" end if close(iunit1) end subroutine mm_matrix_read subroutine mm_rhs_read(iunit2) ! INPUT : vector file (Matrix Market format) ! OUTPUT : rhs(:) -- global array use mmio implicit none integer :: i, iunit2 character*40 :: rhs_file integer :: rows1, cols1, nnz1, nnzmax1 character*10 :: rep1 character*7 :: field1 character*19 :: symm1 integer, allocatable, dimension(:) :: indx1, jndx1 call mminfo(iunit2, rep1, field1, symm1, rows1, cols1, nnz1) if (rows1 /= rows) then print *, "Error : RHS vector size does match with Matrix row size." stop end if print *,"Properties of Matrix_Market RHS vector" print *,"====================================" print '(3(1x,a,1x,a/),3(1x,a,i8/))', & "representation =",rep1, & "type =",field1,& "matrix type =",symm1, & "row (m) =",rows1, & "cols(n) =",cols1, & "Non-zero elm. =",nnz1 nnzmax1 = nnz1 allocate (indx1(rows1), jndx1(cols1)) allocate (rhs(nnz1)) call mmread(iunit2, rep1, field1, symm1, rows1, cols1, nnz1, nnzmax1, & indx1, jndx1, rval=rhs) ! print *,"rhs=", rhs deallocate (indx1, jndx1) close(iunit2) end subroutine mm_rhs_read subroutine coo2csr_convert ! INPUT : Matrix COO data from Matrix Market format via mm_read routine ! OUTPUT : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data use cublas use cusparse !! fortran program as index_base = 1 type(cusparseHandle) :: cusparse_H type(cublasHandle) :: cublas_H integer(8) :: buff_size ! must be 8byte integer. so needs -Mlarge_arrays option character*1,allocatable,dimension(:) :: pBuffer integer, allocatable,dimension(:) :: P integer :: status integer :: ierr_code integer :: i if ( rows /= cols ) then print *, "Error : Matrix must be square (rows = cols) in this routine." stop end if ! set m, n, nz m = rows n = cols ! matrix size nz = nnz ! nz == nnz if ( m /= n ) then print *, "Error : Matrix A is not square matrix ",m,"x",n stop end if ! set CSR storage arrays allocate(csrRowPtrA(m+1)) allocate(csrColIndA(nz) ) allocate(csrValA(nz) ) ! Initilize csrRowPtrA =0 ; csrColIndA =0 ; csrValA = 0_fp_kind ! index base is changed from 1 to 0 for using cusparseXcoosortByRow routine do i = 1, nz indx(i) = indx(i) - 1 jndx(i) = jndx(i) - 1 end do !! call matrix_printout (field, nnz, indx, jndx, rval1=rval) ! create handle for CUSPARSE status = cusparseCreate(cusparse_H) !$acc data copy(indx, jndx) copy(rval, csrValA, csrRowPtrA) ! calculate buffer size to conver !$acc host_data use_device(indx, jndx) status = cusparseXcoosort_bufferSizeExt(cusparse_H, m, n, nz, indx, jndx, buff_size) !$acc end host_data print *, "Xcoosort_bufferSizeExt status =", status ! Prepare working space print *, "working space (bytes) =", buff_size allocate(pBuffer(buff_size), stat=ierr_code) print *, "allocation status code for pBuffer =", ierr_code ! setup permutation vector P allocate(P(nz), stat=ierr_code) print *, "allocation status code for P =", ierr_code !$acc enter data copyin(P, pBuffer) ! P, pBuffer enter in ! Creates an identity map !$acc host_data use_device(P) status = cusparseCreateIdentityPermutation(cusparse_H, nz, P) !$acc end host_data print *, "CreateIdentityPermutation status =", status ! sorts COO format by row !$acc host_data use_device(indx, jndx, P, pBuffer) status = cusparseXcoosortByRow(cusparse_H, m, n, nz, indx, jndx, P, pBuffer) !$acc end host_data print *, "XcoosortByRow status =", status ! gather sorted COO Vals = set "csrValA" !$acc host_data use_device(rval, csrValA, P) status = cusparseDgthr(cusparse_H, nz, rval, csrValA, P, CUSPARSE_INDEX_BASE_ZERO) !$acc end host_data !$acc exit data delete(P, pBuffer) ! P, Pbuffer deleted ! index base is changed from 0 to 1 for subsequent Fortran cusolver routines on device !$acc kernels do i = 1, nz indx(i) = indx(i) + 1 jndx(i) = jndx(i) + 1 end do !$acc end kernels ! COO to CSR format = set "csrRowPtrA" !$acc host_data use_device(indx, csrRowPtrA) status = cusparseXcoo2csr(cusparse_H, indx, nz, m, csrRowPtrA, CUSPARSE_INDEX_BASE_ONE) !$acc end host_data !$acc end data deallocate(P, pBuffer) ! copy jndx to csrColIndA = set "csrColIndA" csrColIndA = jndx if (field == "complex") then call matrix_printout (field, nz, indx, jndx, cval1=cval) elseif (field == "real") then call matrix_printout (field, nz, indx, jndx, rval1=csrValA) end if ! check CSR data call csr_printout end subroutine coo2csr_convert end module mm_coo_cuda_csr_convert
メインプログラムを以下に示す。このプログラムは、複素数の係数行列は対応していない。実数型のデータのみを扱える。最初に、Matrix Market 形式のファイル名を指定し、次に right hand side(rhs) データファイルがあれば、それを指定する。もし、存在しなければ、rhs ベクトルは全て 1.0 として計算する。入力データの読み込み後、coo2csr_convert ルーチンを call して、COO 形式データからCSR 形式データに変換する。変換後、csrRowPtrA、csrColIndA、csrValA 配列が作成され、この CSR 形式の行列を実引数として、BiCGStab ルーチンに渡し、計算が始まる。bicgstab.f90 ルーチンの内容に関しては、前章 12-2 を参照していただきたい。
program main use precision use Matrix_data use mm_coo_cuda_csr_convert use bicg_routine implicit none integer :: i, j, istat integer,parameter :: iunit1 = 10 ! for Matrix Market data (COO) integer,parameter :: iunit2 = 11 ! for LHS, if available character*50 :: mm_file, rhs_file integer :: extendSymMatrix ! 1 = symmetrize the pattern (transform from triangular to full) integer, parameter :: max_iter = 10000 ! max_iteration real(fp_kind), parameter :: tol = 1.0d-13 ! tolerance integer, parameter :: index_base = 1 ! CSR format -- one based index ! Read matrix Market data file extendSymMatrix = 1 ! if =1 .and. symmetric matrix=yes , transform from upper triangular to full print *, "File name for Matrix Market?" read (5, '(a)') mm_file !! mm_file ="DATA/bcsstk14.mtx" print *, "Matrix Market file name = ",mm_file open (iunit1, file=mm_file, status='old') call mm_matrix_read(iunit1, extendSymMatrix) ! Read right hand side vectors, if available. print * print *, "File name for LHS vector? if no available, press ENTER and set as RHS=1.0." read (5, '(a)') rhs_file if (rhs_file == "") then allocate (rhs(rows)) rhs= 1.0_fp_kind else print *, "rhs_file name = ",rhs_file open (iunit2, file=rhs_file, status='old') call mm_rhs_read(iunit2) end if ! Convert COO to CSR sparse matrix ! INPUT : Matrix COO data from Matrix Market format via mm_rhs_read routine ! OUTPUT : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data call coo2csr_convert ! CSR sparse matrix data are set. The belows are arguments for BiCGStab routine. ! Arrays for CUDA cuSPARSE CSR format ! integer :: m IN ! rows size for square matrix ! integer :: n IN ! columns size for square matrix (n=m) ! integer :: nz IN ! nonzero element of A ! integer :: csrRowPtrA (m+1) IN ! the array of offsets corresponding to the start of each row ! integer :: csrColIndA (nz) IN ! sorted by row and by column within each row ! real(fp_kind) :: csrValA (nz) IN ! sorted by row and by column within each row ! real(fp_kind) :: x(m) INOUT ! initial guess & result ! real(fp_kind) :: rhs(m) IN ! right hand side ! Print CSR Matrix data ! print *, "csrValA", csrValA ! print *, "csrRowPtrA",csrRowPtrA ! print *, "csrColIndA", csrColIndA ! Initilize X allocate (x(n)) x = 0.0_fp_kind ! or Initial guess X is set ! Compute by BiCGStab method ! Preconditioned Conjugate Gradient using Incomplete-LU ! factorization with 0 fill-in and no pivoting print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter call BiCGStab (n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, & tol, max_iter, index_base) end program main
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved. ! ! Permission is hereby granted, free of charge, to any person obtaining a ! copy of this software and associated documentation files (the "Software"), ! to deal in the Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, sublicense, ! and/or sell copies of the Software, and to permit persons to whom the ! Software is furnished to do so, subject to the following conditions: ! ! The above copyright notice and this permission notice shall be included in ! all copies or substantial portions of the Software. ! ! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR ! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL ! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER ! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING ! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER ! DEALINGS IN THE SOFTWARE. ! module bicg_routine use precision implicit none contains !============================================================= subroutine BiCGStab(n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, & tol, max_iter, index_base) !------------------------------------------------------------- ! Bi-Conjugate Gradient Stabilized (BiCGStab) ! A is an n × n sparse matrix defined in CSR storage format !------------------------------------------------------------- use precision use measure_time use cublas_v2 use cusparse implicit none ! Arrays for CUDA cuSPARSE CSR format integer :: n ! size of squre matrix integer :: nz ! nonzero element of A integer :: csrRowPtrA(n+1) ! the array of offsets corresponding to the start of each row integer :: csrColIndA (nz) ! sorted by row and by column within each row real(fp_kind) :: csrValA (nz) ! sorted by row and by column within each row real(fp_kind) :: x (n) ! initial guess & result real(fp_kind) :: rhs(n) ! right hand side integer :: index_base ! = 1 integer :: max_iter real(fp_kind) :: tol ! Local variables type(cusparseHandle) :: cusparse_H type(cublasHandle) :: cublas_H type(cusparseMatDescr) :: descrA type(cusparseMatDescr) :: descrL type(cusparseMatDescr) :: descrU type(cusparseSolveAnalysisInfo) :: infoA, infoL, infoU integer :: cusparse_status integer :: cublas_status integer :: status, ierr_code real(fp_kind) :: alpha, alpham1, beta, omega, omegam1 real(fp_kind) :: rho, rho_p real(fp_kind) :: nrmr, nrmr0, temp1, temp2, nrmCK1, nrmCK2 real(fp_kind) :: error, rsum, diff real(fp_kind), allocatable, dimension(:) :: valsILU0 real(fp_kind), allocatable, dimension(:) :: Lov, z real(fp_kind), allocatable, dimension(:) :: r, r_tlda real(fp_kind), allocatable, dimension(:) :: p real(fp_kind), allocatable, dimension(:) :: q real(fp_kind), allocatable, dimension(:) :: t, s real(fp_kind),parameter:: one = 1.0, zero = 0.0, onem1 = -1.0 integer :: i, j, k integer :: nprint integer, parameter :: idbg = 2 real(fp_kind),parameter :: eps = 6.9d-310 ! machine epsilon allocate( r(n), p(n), r_tlda(n) ) allocate( q(n) ) allocate( valsILU0(nz) ) allocate( Lov(n), z(n), t(n) , s(n) ) ! Bi-Conjugate Gradient Stabilized (BiCGStab) using incomplete-LU print '(/1x,a)',"-----------------------------------------------" print '(1x,a)', " Bi-Conjugate Gradient Stabilized (BiCGStab)" print *, "-----------------------------------------------" ! print *,"rhs=",rhs do i = 1, n r(i) = rhs(i) ! set r = rhs, where Ax = rhs (Ax = b) end do alpha = 1.0_fp_kind beta = 0.0_fp_kind rho = 0.0_fp_kind k = 0 ! Clock status = init_clock status = start_clock ! creater handle for cuBLAS status = cublasCreate(cublas_H) ! creater handle for CUSPARSE status = cusparseCreate(cusparse_H) ! matrix A descriptor(descrA) and properties status = cusparseCreateMatDescr(descrA) status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL) ! must be GENERAL for using csrilu0 status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base = 1 ! initiialize the analysis info for A matrix (infoA = sparsity pattern of A) status = cusparseCreateSolveAnalysisInfo(infoA) ! initiialize the L U preconditioner Info status = cusparseCreateSolveAnalysisInfo(infoL) status = cusparseCreateSolveAnalysisInfo(infoU) ! Setup for Lower Matrix status = cusparseCreateMatDescr (descrL) status = cusparseSetMatType (descrL, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE) status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER) status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT) ! Setup for Upper Matrix status = cusparseCreateMatDescr (descrU) status = cusparseSetMatType (descrU, CUSPARSE_MATRIX_TYPE_GENERAL) status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE) status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER) status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT) ! copy original vals to new ILU0-vals valsILU0 = csrValA !print *,"X=",X !print *,"r_init=",r !Print CSR Matrix data ! print *,"m,n,nz",m,n,nz ! print *, "csrValA", csrValA ! print *, "csrRowPtrA",csrRowPtrA ! print *, "csrColIndA", csrColIndA ! print *,"csrValA=",(csrValA(i),i=1,100) !$acc data copy(r, X) copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA ) & !$acc create(p, q, Lov, z, r_tlda ) create(t, s) ! === Initial guess x0 and residual r = b - Ax0 where r = b !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, r, r_tlda, p ) cusparse_status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0, q is temporary status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1) ! r = r - A x0 ! set as r_tlda = r and p = r status = cublasDcopy(cublas_H, n, r, 1, r_tlda, 1) ! r_tlda = r status = cublasDcopy(cublas_H, n, r, 1, p, 1) ! p = r status = cublasDnrm2(cublas_H, n, r, 1, nrmr0) ! nrmr0 = ||r||2 !$acc end host_data print *,"nrmr0=",nrmr0, "tol*nrmr0=",tol*nrmr0 !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0) ! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoL/descrL cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoL) !print*,"cusparseDcsrsv_analysis1=", cusparse_status ! the analysis phase of the solution of a sparse linear system for infoU/descrU cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, & n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoU) !print*,"cusparseDcsrsv_analysis2=",cusparse_status ! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix ! output : valsILU0 cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descrA, valsILU0, & csrRowPtrA, csrColIndA, infoL); !$acc end host_data !$acc wait ! check for the solution using valsILU0 vector from Incomplete-LU factorization !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, & valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov) status = cublasDnrm2(cublas_H, n, Lov, 1, nrmCK1) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, & valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z) ! solve z status = cublasDnrm2(cublas_H, n, z, 1, nrmCK2) !$acc end host_data print*,"nrmCK1, nrmCK2 =", nrmCK1, nrmCK2 if (nrmCK1 == 0.0 .or. nrmCK2 == 0.0 .or. nrmCK1 /= nrmCK1 .or. nrmCK2 /= nrmCK2) then print *, "ERROR : This matrix is not be able to solve using valsILU0 vector from Incomplete-LU factorization." stop end if status = end_clock time_tag = "The analysis and Incomplete-LU factorization phase of ILU(0) :" status = print_time(time_tag) status = start_clock do while ( k <= max_iter ) k = k + 1 rho_p = rho !$acc host_data use_device(r_tlda, r) status = cublasDdot(cublas_H, n, r_tlda, 1, r, 1, rho) ! rho = (r_tlda, r) !$acc end host_data if ( abs(rho) < eps .or. rho == 0.0 ) then print *,"ERROR! rho is zero", rho, "BiCG calculation condition does not hold. STOP!" exit end if if ( k > 1 ) then if ( abs(omega) < eps .or. rho == 0.0 ) then print *,"ERROR! omega is zero, BiCG calculation condition does not hold. STOP!" exit end if beta = (rho/rho_p) * (alpha/omega) omegam1 = -omega !$acc host_data use_device(q, p, r) cublas_status = cublasDaxpy(cublas_H, n, omegam1, q, 1, p, 1) ! p = p -omega * q cublas_status = cublasDscal(cublas_H, n, beta, p, 1) ! p = beta * p cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1) ! p = p + r !$acc end host_data end if ! Solve using valsILU0 vector from Incomplete-LU factorization, Solve M z = p !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, p, Lov, z) ! Forwadaard Solve, L part for infoA(the sparsity pattern of descL) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, & valsILU0, csrRowPtrA, csrColIndA, infoL, p, Lov) ! Back Substitution (U part for infoU = descU) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, & valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z) ! solve z !$acc end host_data ! Compute q = A z !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r_tlda, z, q) cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, z, zero, q) ! q = A z status = cublasDdot(cublas_H, n, r_tlda, 1, q, 1, temp1) ! temp1 = (r_tlda, q) !$acc end host_data alpha = rho / temp1 ! alpha = rho / (r_tlda, q) alpham1 = -alpha !$acc host_data use_device(q, z, X, r) cublas_status = cublasDaxpy(cublas_H, n, alpham1, q, 1, r, 1) ! r = r - alpha * q cublas_status = cublasDaxpy(cublas_H, n, alpha, z, 1, X, 1) ! X = X + alpha * z cublas_status = cublasDnrm2(cublas_H, n, r, 1, nrmr) ! nrmr = || r ||2 !$acc end host_data ! check for convergence if ( nrmr < tol*nrmr0 ) exit ! solve M s = r !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, s, t) ! Forward Solve, L part for infoA(the sparsity pattern of descL) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, & valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov) ! Back Substitution (U part for infoU = descU) cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, & valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, s) ! solve s ! Compute t = A s cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, & ! t = A s n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, s, zero, t) status = cublasDdot(cublas_H, n, t, 1, r, 1, temp1) ! temp1 = (t, r) status = cublasDdot(cublas_H, n, t, 1, t, 1, temp2) ! temp2 = (t, t) !$acc end host_data omega = temp1/temp2 omegam1 = -omega !$acc host_data use_device(s, t, X, r) status = cublasDaxpy(cublas_H, n, omega, s, 1, X, 1) ! X = X + omega * s status = cublasDaxpy(cublas_H, n, omegam1, t, 1, r, 1) ! r = r - omega * t status = cublasDnrm2(cublas_H, n, r, 1, nrmr) ! nrmr = || r ||2 !$acc end host_data ! check for convergence if ( nrmr < tol*nrmr0 ) exit if (idbg == 1 ) print *, "iteration=", k, " residual =", nrmr end do !$acc end data print *, "iteration=", k, " residual =", nrmr print *, "nrmr/nrmr0 =", nrmr/nrmr0 print * if ( nrmr < tol*nrmr0 ) then print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED." else print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is FAILED." end if status = end_clock time_tag = "Elapased time for BiGGstab loop with ILU(0) :" status = print_time(time_tag) ! check result error = 0.0 do i = 1, n rsum = 0.0 do j = csrRowPtrA(i), csrRowPtrA(i+1)-1 rsum = rsum + csrValA(j) * X(csrColIndA(j)) end do diff = abs(rsum - rhs(i)) if (diff > error) error = diff end do print '(1x,a,d20.13/)',"error =",error ! output result if (idbg >= 1) then nprint = n if ( n > 10 ) nprint = 10 do i = 1, nprint print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do print *,"" do i = n-nprint, n print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i) end do end if status = cublasDestroy(cublas_H) status = cusparseDestroy(cusparse_H) status = cusparseDestroyMatDescr(descrA) status = cusparseDestroyMatDescr(descrL) status = cusparseDestroyMatDescr(descrU) status = cusparseDestroySolveAnalysisInfo(infoA) status = cusparseDestroySolveAnalysisInfo(infoL) status = cusparseDestroySolveAnalysisInfo(infoU) deallocate( q, r, r_tlda, p ) deallocate( Lov, z, valsILU0, t, s ) end subroutine BiCGStab End module bicg_routine
tar file 内に提供した Makefile の記述に沿って、コンパイル&リンクを行う。なお、この Makefile では、cuda9.0 ツールキットのライブラリを使用するように設定しているが、cuda8.0 を使用しても良い。cusparse と cublas ライブラリは、PGIコンパイラにバンドルしているルーチンを使用する形になっている。
$ make pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays precision.f90 pgf90 -c -Minfo -Kieee -O -Mfixed mmio.f pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays measure_time.f90 pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays mm_convert_csr.f90 matrix_printout: 68, Loop not vectorized/parallelized: contains call 77, Loop not vectorized/parallelized: contains call set_matrix: 138, Unrolled inner loop 8 times Generated 2 prefetch instructions for the loop 148, Loop not vectorized/parallelized: contains call 169, Memory copy idiom, loop replaced by call to __c_mcopy4 170, Memory copy idiom, loop replaced by call to __c_mcopy4 173, Memory copy idiom, loop replaced by call to __c_mcopy8 177, Memory copy idiom, loop replaced by call to __c_mcopy4 181, Memory copy idiom, loop replaced by call to __c_mcopy16 coo2csr_convert: 356, Memory zero idiom, loop replaced by call to __c_mzero8 Memory zero idiom, loop replaced by call to __c_mzero4 359, Generated an alternate version of the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop 369, Generating copy(csrvala(:),indx(:),rval(:),jndx(:),csrrowptra(:)) 387, Generating enter data copyin(p(:),pbuffer(:)) 406, Generating exit data delete(p(:),pbuffer(:)) 410, Loop is parallelizable Accelerator kernel generated Generating Tesla code 410, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 426, Memory copy idiom, loop replaced by call to __c_mcopy4 pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays bicgstab.f90 bicgstab: 93, Memory copy idiom, loop replaced by call to __c_mcopy8 138, Memory copy idiom, loop replaced by call to __c_mcopy8 149, Generating create(q(:)) Generating copyin(valsilu0(:)) Generating create(t(:)) Generating copyin(csrrowptra(:),csrcolinda(:)) Generating create(lov(:),p(:)) Generating copyin(csrvala(:)) Generating create(z(:)) Generating copy(x(:),r(:)) Generating create(s(:),r_tlda(:)) 207, Loop not vectorized/parallelized: potential early exits 318, Generated an alternate version of the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop FMA (fused multiply-add) instruction(s) generated 330, Loop not vectorized/parallelized: contains call 334, Loop not vectorized/parallelized: contains call pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays main.f90 main: 36, Memory set idiom, loop replaced by call to __c_mset8 68, Memory zero idiom, loop replaced by call to __c_mzero8 pgf90 -o a.out precision.o mmio.o measure_time.o mm_convert_csr.o bicgstab.o main.o -Mcuda -acc -ta=tesla,cuda8.0,cc35,cc60
サンプル係数行列(mtxデータ)は、The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) から入手する。NVIDIA の Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS というホワイトペーパーにテストで使用されている係数行列の名前が以下のようにリストアップされている。この名前を The SuiteSparse Matrix Collection 内の検索サイトで探すとテストデータが表示され、「Matrxi Market」形式ファイルをダウンロードすれば、mtx ファイルが入手できる。なお、係数行列の素性によっては解けない場合もある。
# | Matrix Name | m,n | nnz | s.p.d. | Application |
1. | offshore | 259,789 | 4,242,673 | yes | Geophysics:https://sparse.tamu.edu/Um/offshore |
2. | af_shell3 | 504,855 | 17,562,051 | yes | Mechanics:https://sparse.tamu.edu/Schenk_AFE/af_shell3 |
3. | parabolic_fem | 525,825 | 3,674,625 | yes | General:https://sparse.tamu.edu/Wissgott/parabolic_fem |
4. | apache2 | 715,176 | 4,817,870 | yes | Mechanics:https://sparse.tamu.edu/GHS_psdef/apache2 |
5. | ecology2 | 999,999 | 4,995,991 | yes | Biology:https://sparse.tamu.edu/McRae/ecology2 |
6. | thermal2 | 1,228,045 | 8,580,313 | yes | Thermal Simulation:https://sparse.tamu.edu/Schmid/thermal2 |
7. | G3_circuit | 1,585,478 | 7,660,826 | yes | Circuit Simulation:https://sparse.tamu.edu/AMD/G3_circuit |
8. | FEM_3D_thermal2 | 147,900 | 3,489,300 | no | Mechanics:https://sparse.tamu.edu/Botonakis/FEM_3D_thermal2 |
10. | ASIC_320ks | 321,671 | 1,316,08511 | no | Circuit Simulation:https://sparse.tamu.edu/Sandia/ASIC_320ks |
11. | cage13 | 445,315 | 7,479,343 | no | Biology:https://sparse.tamu.edu/vanHeukelum/cage13 |
12. | atmosmodd | 1,270,432 | 8,814,880 | no | Atmospheric Model:https://sparse.tamu.edu/Bourchtein/atmosmodd |
make run コマンドで実行が開始される。入力 mtx ファイルが求められるので、入手した mtx ファイル名を指定する。次に rhs ベクトルファイルが存在する場合は、そのファイル名を指定する。存在しない場合は、enter を押下し、実行に進む。もし、反復法の計算が不能となった場合は、cuSOLVER SP ルーチンを利用した直接解法で計算してみることもお勧めする。
[kato@photon32]$ make run ===== Program run ===== a.out File name for Matrix Market? lap2D_5pt_n100.mtx Matrix Market file name = lap2D_5pt_n100.mtx Properties of Matrix_Market matrix ==================================== representation = coordinate type = real matrix type = symmetric row (m) = 10000 cols(n) = 10000 Non-zero elm. = 49600 extendSymMatrix= 1 ==== Triangular Symmetric matrix has been transformed to full matrix form ==== i,indx,jndx 1 1 1 4.000000000000000 i,indx,jndx 2 2 1 -1.000000000000000 i,indx,jndx 3 1 2 -1.000000000000000 i,indx,jndx 4 101 1 -1.000000000000000 i,indx,jndx 5 1 101 -1.000000000000000 i,indx,jndx 6 2 2 4.000000000000000 i,indx,jndx 7 3 2 -1.000000000000000 i,indx,jndx 8 2 3 -1.000000000000000 i,indx,jndx 9 102 2 -1.000000000000000 i,indx,jndx 10 2 102 -1.000000000000000 i,indx,jndx 49590 9996 9997 -1.000000000000000 i,indx,jndx 49591 9997 9997 4.000000000000000 i,indx,jndx 49592 9998 9997 -1.000000000000000 i,indx,jndx 49593 9997 9998 -1.000000000000000 i,indx,jndx 49594 9998 9998 4.000000000000000 i,indx,jndx 49595 9999 9998 -1.000000000000000 i,indx,jndx 49596 9998 9999 -1.000000000000000 i,indx,jndx 49597 9999 9999 4.000000000000000 i,indx,jndx 49598 10000 9999 -1.000000000000000 i,indx,jndx 49599 9999 10000 -1.000000000000000 i,indx,jndx 49600 10000 10000 4.000000000000000 File name for LHS vector? if no available, press ENTER and set as RHS=1.0. Xcoosort_bufferSizeExt status = 0 working space (bytes) = 794880 allocation status code for pBuffer = 0 allocation status code for P = 0 CreateIdentityPermutation status = 0 XcoosortByRow status = 0 i,indx,jndx 1 1 1 4.000000000000000 i,indx,jndx 2 1 2 -1.000000000000000 i,indx,jndx 3 1 101 -1.000000000000000 i,indx,jndx 4 2 1 -1.000000000000000 i,indx,jndx 5 2 2 4.000000000000000 i,indx,jndx 6 2 3 -1.000000000000000 i,indx,jndx 7 2 102 -1.000000000000000 i,indx,jndx 8 3 2 -1.000000000000000 i,indx,jndx 9 3 3 4.000000000000000 i,indx,jndx 10 3 4 -1.000000000000000 i,indx,jndx 49590 9998 9898 -1.000000000000000 i,indx,jndx 49591 9998 9997 -1.000000000000000 i,indx,jndx 49592 9998 9998 4.000000000000000 i,indx,jndx 49593 9998 9999 -1.000000000000000 i,indx,jndx 49594 9999 9899 -1.000000000000000 i,indx,jndx 49595 9999 9998 -1.000000000000000 i,indx,jndx 49596 9999 9999 4.000000000000000 i,indx,jndx 49597 9999 10000 -1.000000000000000 i,indx,jndx 49598 10000 9900 -1.000000000000000 i,indx,jndx 49599 10000 9999 -1.000000000000000 i,indx,jndx 49600 10000 10000 4.000000000000000 == CSR data check : Start from 1 to 10 elements == csrRowPtrA= 1 4 8 12 16 20 24 28 32 36 csrColIndA= 1 2 101 1 2 3 102 2 3 4 csrValA = 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 == Start from endpoint-10 to endpoint == csrRowPtrA= 49562 49566 49570 49574 49578 49582 49586 49590 49594 49598 49601 csrColIndA= 9898 9997 9998 9999 9899 9998 9999 10000 9900 9999 10000 csrValA= -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 tol = .1000000000D-12 max_iteration = 10000 ----------------------------------------------- Bi-Conjugate Gradient Stabilized (BiCGStab) ----------------------------------------------- nrmr0= 100.0000000000000 tol*nrmr0= 1.0000000000000001E-011 nrmCK1, nrmCK2 = 238.2158421093953 166.7314075873577 The analysis and Incomplete-LU factorization phase of ILU(0) : 0.212 (sec) iteration= 81 residual = 6.6059148572793421E-012 nrmr/nrmr0 = 6.6059148572793405E-014 Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED. Elapased time for BiGGstab loop with ILU(0) : 0.961 (sec) error = 0.6593836587854D-11 x( 1)=0.275607474398E+01 x( 2)=0.501214948795E+01 x( 3)=0.696587678467E+01 x( 4)=0.871021429238E+01 x( 5)=0.102960547760E+02 x( 6)=0.117547659017E+02 x( 7)=0.131074411873E+02 x( 8)=0.143691932892E+02 x( 9)=0.155513693040E+02 x( 10)=0.166627976257E+02 x( 9990)=0.177105403722E+02 x( 9991)=0.166627976257E+02 x( 9992)=0.155513693040E+02 x( 9993)=0.143691932892E+02 x( 9994)=0.131074411873E+02 x( 9995)=0.117547659017E+02 x( 9996)=0.102960547760E+02 x( 9997)=0.871021429238E+01 x( 9998)=0.696587678467E+01 x( 9999)=0.501214948795E+01 x( 10000)=0.275607474398E+01
上記で示したプログラム一式は、Linux 上で動作させることを前提としたものである。同じプログラムを Windows 上で動作させるために Makefile を変更したものとソース・ファイル一式を以下に提供する。以下のファイルをダウンロードしていただきたい。
プログラム tar ファイル:BiCGStab.zip
以下の例は、PGI 17.10 バージョンの PGI コマンドプロンプト(端末)を使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Windows -tp sandybridge PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 Device Number: 0 Device Name: GeForce GTX 780 Device Revision Number: 3.5 Global Memory Size: 3221225472 Number of Multiprocessors: 12 Number of SP Cores: 2304 Number of DP Cores: 768 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 941 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 3004 MHz Memory Bus Width: 384 bits L2 Cache Size: 1572864 bytes Max Threads Per SMP: 2048 Async Engines: 1 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc35
PGI$ make pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 precision.f90 pgf90 -c -Minfo -Kieee -O -Mfixed mmio.f pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 measure_time.f90 pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 mm_convert_csr.f90 matrix_printout: 68, Loop not vectorized/parallelized: contains call 77, Loop not vectorized/parallelized: contains call set_matrix: 138, Unrolled inner loop 8 times Generated 2 prefetch instructions for the loop 148, Loop not vectorized/parallelized: contains call 169, Memory copy idiom, loop replaced by call to __c_mcopy4 170, Memory copy idiom, loop replaced by call to __c_mcopy4 173, Memory copy idiom, loop replaced by call to __c_mcopy8 177, Memory copy idiom, loop replaced by call to __c_mcopy4 181, Memory copy idiom, loop replaced by call to __c_mcopy16 coo2csr_convert: 356, Memory zero idiom, loop replaced by call to __c_mzero8 Memory zero idiom, loop replaced by call to __c_mzero4 359, Generated 3 alternate versions of the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop Generated vector simd code for the loop Generated 2 prefetch instructions for the loop 369, Generating copy(csrvala(:),indx(:),jndx(:),rval(:),csrrowptra(:)) 387, Generating enter data copyin(p(:),pbuffer(:)) 406, Generating exit data delete(pbuffer(:),p(:)) 410, Loop is parallelizable Accelerator kernel generated Generating Tesla code 410, !$acc loop gang, vector(128) ! blockidx%x threadidx%x 426, Memory copy idiom, loop replaced by call to __c_mcopy4 pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 bicgstab.f90 bicgstab: 93, Memory copy idiom, loop replaced by call to __c_mcopy8 138, Memory copy idiom, loop replaced by call to __c_mcopy8 149, Generating create(q(:)) Generating copyin(valsilu0(:)) Generating create(t(:)) Generating copyin(csrrowptra(:),csrcolinda(:)) Generating create(lov(:),p(:)) Generating copyin(csrvala(:)) Generating create(z(:)) Generating copy(x(:),r(:)) Generating create(s(:),r_tlda(:)) 207, Loop not vectorized/parallelized: potential early exits 318, Generated 2 alternate versions of the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop Generated vector simd code for the loop containing reductions Generated a prefetch instruction for the loop 330, Loop not vectorized/parallelized: contains call 334, Loop not vectorized/parallelized: contains call pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 main.f90 main: 36, Memory set idiom, loop replaced by call to __c_mset8 68, Memory zero idiom, loop replaced by call to __c_mzero8 pgf90 -o a.out precision.obj mmio.obj measure_time.obj mm_convert_csr.obj bicgstab.obj main.obj -Mcuda -acc -ta=tesla,cuda9.0,cc35,cc60 -Mcudalib=cusparse,cublas
PGI$ make run ===== Program run ===== a.out File name for Matrix Market? lap2D_5pt_n100.mtx Matrix Market file name = lap2D_5pt_n100.mtx Properties of Matrix_Market matrix ==================================== representation = coordinate type = real matrix type = symmetric row (m) = 10000 cols(n) = 10000 Non-zero elm. = 49600 extendSymMatrix= 1 ==== Triangular Symmetric matrix has been transformed to full matrix form ==== i,indx,jndx 1 1 1 4.000000000000000 i,indx,jndx 2 2 1 -1.000000000000000 i,indx,jndx 3 1 2 -1.000000000000000 i,indx,jndx 4 101 1 -1.000000000000000 i,indx,jndx 5 1 101 -1.000000000000000 i,indx,jndx 6 2 2 4.000000000000000 i,indx,jndx 7 3 2 -1.000000000000000 i,indx,jndx 8 2 3 -1.000000000000000 i,indx,jndx 9 102 2 -1.000000000000000 i,indx,jndx 10 2 102 -1.000000000000000 i,indx,jndx 49590 9996 9997 -1.000000000000000 i,indx,jndx 49591 9997 9997 4.000000000000000 i,indx,jndx 49592 9998 9997 -1.000000000000000 i,indx,jndx 49593 9997 9998 -1.000000000000000 i,indx,jndx 49594 9998 9998 4.000000000000000 i,indx,jndx 49595 9999 9998 -1.000000000000000 i,indx,jndx 49596 9998 9999 -1.000000000000000 i,indx,jndx 49597 9999 9999 4.000000000000000 i,indx,jndx 49598 10000 9999 -1.000000000000000 i,indx,jndx 49599 9999 10000 -1.000000000000000 i,indx,jndx 49600 10000 10000 4.000000000000000 File name for LHS vector? if no available, press ENTER and set as RHS=1.0. Xcoosort_bufferSizeExt status = 0 working space (bytes) = 794880 allocation status code for pBuffer = 0 allocation status code for P = 0 CreateIdentityPermutation status = 0 XcoosortByRow status = 0 i,indx,jndx 1 1 1 4.000000000000000 i,indx,jndx 2 1 2 -1.000000000000000 i,indx,jndx 3 1 101 -1.000000000000000 i,indx,jndx 4 2 1 -1.000000000000000 i,indx,jndx 5 2 2 4.000000000000000 i,indx,jndx 6 2 3 -1.000000000000000 i,indx,jndx 7 2 102 -1.000000000000000 i,indx,jndx 8 3 2 -1.000000000000000 i,indx,jndx 9 3 3 4.000000000000000 i,indx,jndx 10 3 4 -1.000000000000000 i,indx,jndx 49590 9998 9898 -1.000000000000000 i,indx,jndx 49591 9998 9997 -1.000000000000000 i,indx,jndx 49592 9998 9998 4.000000000000000 i,indx,jndx 49593 9998 9999 -1.000000000000000 i,indx,jndx 49594 9999 9899 -1.000000000000000 i,indx,jndx 49595 9999 9998 -1.000000000000000 i,indx,jndx 49596 9999 9999 4.000000000000000 i,indx,jndx 49597 9999 10000 -1.000000000000000 i,indx,jndx 49598 10000 9900 -1.000000000000000 i,indx,jndx 49599 10000 9999 -1.000000000000000 i,indx,jndx 49600 10000 10000 4.000000000000000 == CSR data check : Start from 1 to 10 elements == csrRowPtrA= 1 4 8 12 16 20 24 28 32 36 csrColIndA= 1 2 101 1 2 3 102 2 3 4 csrValA = 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 == Start from endpoint-10 to endpoint == csrRowPtrA= 49562 49566 49570 49574 49578 49582 49586 49590 49594 49598 49601 csrColIndA= 9898 9997 9998 9999 9899 9998 9999 10000 9900 9999 10000 csrValA= -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 tol = .1000000000D-12 max_iteration = 10000 ----------------------------------------------- Bi-Conjugate Gradient Stabilized (BiCGStab) ----------------------------------------------- nrmr0= 100.0000000000000 tol*nrmr0= 1.0000000000000001E-011 nrmCK1, nrmCK2 = 238.2158421093953 166.7314075873577 The analysis and Incomplete-LU factorization phase of ILU(0) : 0.253 (sec) iteration= 81 residual = 6.6059148572793421E-012 nrmr/nrmr0 = 6.6059148572793405E-014 Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED. Elapased time for BiGGstab loop with ILU(0) : 0.593 (sec) error = 0.6593836587854D-11 x( 1)=0.275607474398E+01 x( 2)=0.501214948795E+01 x( 3)=0.696587678467E+01 x( 4)=0.871021429238E+01 x( 5)=0.102960547760E+02 x( 6)=0.117547659017E+02 x( 7)=0.131074411873E+02 x( 8)=0.143691932892E+02 x( 9)=0.155513693040E+02 x( 10)=0.166627976257E+02 x( 9990)=0.177105403722E+02 x( 9991)=0.166627976257E+02 x( 9992)=0.155513693040E+02 x( 9993)=0.143691932892E+02 x( 9994)=0.131074411873E+02 x( 9995)=0.117547659017E+02 x( 9996)=0.102960547760E+02 x( 9997)=0.871021429238E+01 x( 9998)=0.696587678467E+01 x( 9999)=0.501214948795E+01 x( 10000)=0.275607474398E+01