前章 12-2 では、OpenACC + cuSPARSE を利用した Bi-Conjugate Gradient Stabilized(BiCGStab)実装プログラム例を示した。このルーチンを汎用的に使用できるように、疎行列のフォーマット変換も含んだ形でプログラムを組み直した。NVIDIA CUDA 線形ライブラリでは、計算に使用するスパース行列のフォーマットが CSR タイプを基本とするため、CUDA ライブラリを使用する場合は、事前に、CSR フォーマットに変換する必要がある。この項では、cuSPARSE ライブラリを使用してデータ・フォーマットの変換を行い、デバイス側で BiCGStab 反復法で線形方程式を解く Fortran プログラムの例を示すこととする。ここで例示するソース・ファイルと Makefile 一式をアーカイブした tar ファイルを提供する。以下のファイルをダウンロードしていただきたい。
プログラム tar ファイル:openacc_bicgstab.tar.gz (2017.11.14更新しました。プログラム修正)
以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 NVRM version: NVIDIA UNIX x86_64 Kernel Module 384.59 Wed Jul 19 23:53:34 PDT 2017 Device Number: 0 Device Name: GeForce GTX 1080 Ti Device Revision Number: 6.1 Global Memory Size: 11714691072 Number of Multiprocessors: 28 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 1657 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 5505 MHz Memory Bus Width: 352 bits L2 Cache Size: 2883584 bytes Max Threads Per SMP: 2048 Async Engines: 2 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc60
スパース行列を表現するフォーマットは、様々提案されているが、ここでは、サンプル係数行列の流通に使用されている Matrix Market の MTX フォーマットを入力として CSR フォーマットに変換するプログラムを紹介する。サンプル係数行列を提供しているサイトとして有名なのは、他に、The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) があり、ここでのデータを利用して線形システムのテスト実行が可能である。これらのサイトでは、MTXフォーマットのデータの提供を行っている。
Matrix Market(MTX) フォーマットは、ヘッダー部分を除けば、非ゼロ要素の配列位置型(coordinate) で表す、いわゆる COO 形式となっている。COO フォーマットとは、(行, 列, 値) のタプルでリストを作る方法である。CUDA ライブラリ関数へ入力する係数行列のストレージ・フォーマットは CSR 形式としているため、COO 形式を CSR 形式に変換する必要がある。
提供プログラムの注意点を述べる。Matrix Market のフォーマットで「symmetric」属性を持つファイルは、対称行列の上三角行列のみ収納されている。一方、「general」属性を持つものは、「一般行列」を意味しノンゼロの配列要素が全て記述されている。mm_convert_csr.f90 プログラムは、デフォルトで「symmetric」属性を有する対称行列要素を「一般行列」に拡張するようにしている。プログラム内で、対称行列を一般行列に拡張するためのフラグ変数は extendSymMatrix であり、これにより制御している。「一般行列」に拡張する理由は、BiCGstab の前処理に不完全 LU 分解 (cusparseDcsrilu0) を施しているが、この cusparseDcsrilu0 ルーチンの入力仕様が当然ながら「一般行列」の入力を必要とするからである。
Matrix Market サイトでは、MTXフォーマットのファイルを読み込むための fortran ルーチン(mmio.f) が提供されている。今回、tar ファイル内に含めた mmio.f は、サブルーチンの仮引数にオプショナル引数を使用できるように変更し、F90 MODULE形式にしたものとした。当該 mmio.f ルーチンは、mtx ファイルのヘッダ情報を読み、COO 形式のデータを読み込むためのルーチンとなる。
以下に示すプログラム(mm_convert_csr.f90) は、MTX ファイルを CSR 形式のデータに変換するものである。データ形式の変換は、cuSPARSE ライブラリの関数を利用するため、デバイス側での処理を含む。このため、デバイス・データであることを指示するために OpenACC ディレクティブを利用する。以下の表は、mm_convert_csr.f90 内の主なルーチンの機能を示したものである。
| module Matrix_data | COO形式の配列バッファ宣言とCSR形式のデータ配列宣言、結果ベクトル配列、RHS ベクトル配列宣言 | 
| subroutine mm_matrix_read | MTX形式のデータの読み込みルーチン | 
| subroutine mm_rhs_read | 線形方程式の右辺項ベクトルの読み込みルーチン | 
| subroutine coo2csr_convert | COO形式データをCSR形式データに変換するルーチン。配列の index base は 1 として変換する。cuSPARSE ライブラリを使用するため、デバイス側での処理も含むため、OpenACC 指示行を使用する。なお、このルーチンは、データ型として倍精度の対応のみ行なっており、複素数型を扱うには修正が必要である。なお、このルーチンの処理後、csrValA、csrRowPtrA、csrColIndA の CSR データの配列要素は、ホスト側にもデバイス側にも配置された状態にあなる。 | 
CSR フォーマットの詳細は、NVIDIA cuSPARSE ドキュメントを参考にして欲しい。係数行列を CSR 形式で表現する際に必要な入力用データ配列は、以下のとおりである。
| nz | 整数 | 行列内の非ゼロ要素の数 | 
| csrValA | 配列 | 行メジャー形式で A のすべての非ゼロ値を保持するための長さ nz のデータ配列 in row-major format. | 
| csrRowPtrA | 配列 | 配列 csrColIndA および csrValA にインデックスを保持するための長さ m + 1 の整数配列。 この配列の最初の m 個のエントリは、 i = i、...、m の i 行目の最初の非ゼロ要素のインデックスを含み、最後のエントリは nnz + csrRowPtrA(0)を含みます。 一般に、csrRowPtrA(0)は、 0 あるいは 1 から開始されることを表すインデックスベースを表し、これは、それぞれ 0 または 1 となる。ここでは 1 を使用する。 | 
| csrColIndA | 配列 | 配列 csrValA の対応する要素の列インデックスを含む長さ nz の整数配列 | 
! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved. ! ! Permission is hereby granted, free of charge, to any person obtaining a ! copy of this software and associated documentation files (the "Software"), ! to deal in the Software without restriction, including without limitation ! the rights to use, copy, modify, merge, publish, distribute, sublicense, ! and/or sell copies of the Software, and to permit persons to whom the ! Software is furnished to do so, subject to the following conditions: ! ! The above copyright notice and this permission notice shall be included in ! all copies or substantial portions of the Software. ! ! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR ! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, ! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL ! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER ! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING ! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER ! DEALINGS IN THE SOFTWARE. ! module Matrix_data ! COO storage data and CSR data -- declare variables use precision implicit none ! Buffer for Matrix Market data format (as we say "COO" format) integer :: rows, cols, nnz, nnzmax character*10 :: rep character*7 :: field character*19 :: symm integer, allocatable, dimension(:) :: indx, jndx integer, allocatable, dimension(:) :: ival real(fp_kind), allocatable, dimension(:) :: rval complex*16, allocatable, dimension(:) :: cval ! Arrays for CUDA cuSPARSE CSR format (available for REAL data only) integer :: m ! rows size of A matrix integer :: n ! columns size of A matrix integer :: nz ! nonzero element of A integer, allocatable, dimension(:) :: csrRowPtrA ! m + 1 integer, allocatable, dimension(:) :: csrColIndA ! real(fp_kind), allocatable, dimension(:) :: csrValA ! real(fp_kind), allocatable, dimension(:) :: x ! initial guess & result vector real(fp_kind), allocatable, dimension(:) :: rhs ! right hand side vector end module Matrix_data module mm_coo_cuda_csr_convert use precision use Matrix_data use mmio implicit none contains subroutine matrix_printout(f, nonzero, i_ndx, j_ndx, rval1, ival1, cval1) implicit none integer :: i, nonzero character*7 :: f integer :: i_ndx(nonzero) integer :: j_ndx(nonzero) integer, optional :: ival1(nonzero) double precision, optional :: rval1(nonzero) complex*16, optional :: cval1(nonzero) do i = 1, min(nonzero,10) if (f == "complex") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),cval1(i) elseif (f == "real") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),rval1(i) elseif (f == "integer") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),ival1(i) end if end do do i = max(nonzero-10, 1), nonzero if (f == "complex") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),cval1(i) elseif (f == "real") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),rval1(i) elseif (f == "integer") then print *,"i,indx,jndx",i, i_ndx(i),j_ndx(i),ival1(i) end if end do end subroutine matrix_printout subroutine csr_printout implicit none integer :: i print * print *,"== CSR data check : Start from 1 to 10 elements ==" if (field == "complex") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=1,min(m+1,10)) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=1,min(nz,10)) print '(1x a/, 7(1x,2d10.4))',"csrValA =",(csrValA(i), i=1,min(nz,10)) elseif (field == "real") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=1,min(m+1,10)) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=1,min(nz,10)) print '(1x, a/, 7(1x,d10.4))',"csrValA =",(csrValA(i), i=1,min(nz,10)) end if print * print *,"== Start from endpoint-10 to endpoint ==" if (field == "complex") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=max(m+1-10,1),m+1) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=max(nz-10,1),nz) print '(1x, a/, 7(1x,2d10.4))',"csrValA=",(csrValA(i), i=max(nz-10,1),nz) elseif (field == "real") then print '(1x, a/, 7(1x,I10 ))',"csrRowPtrA=",(csrRowPtrA(i), i=max(m+1-10,1), m+1) print '(1x, a/, 7(1x,I10 ))',"csrColIndA=",(csrColIndA(i), i=max(nz-10,1),nz) print '(1x, a/, 7(1x,d10.4))',"csrValA=",(csrValA(i), i=max(nz-10,1),nz) end if end subroutine csr_printout subroutine set_matrix(iunit1, extendSymMatrix, field, symm, t_nnz, t_indx, t_jndx, & t_ival, t_rval, t_cval) use precision use Matrix_data character*7 :: field character*19 :: symm integer :: t_nnz integer ,optional :: t_indx(t_nnz), t_jndx(t_nnz) integer, optional :: t_ival(t_nnz) real(fp_kind), optional :: t_rval(t_nnz) complex*16, optional :: t_cval(t_nnz) integer :: i, j, iunit1, count integer :: extendSymMatrix ! 1 = extend , 0 = no if (symm == "symmetric" .and. extendSymMatrix == 1) then ! count number of non-diagonal elements count = 0 do i = 1, t_nnz if (t_indx(i) /= t_jndx(i)) count=count+1 end do allocate(indx(count + t_nnz)) allocate(jndx(count + t_nnz)) if (field == "real" ) allocate(rval(count + t_nnz)) if (field == "integer") allocate(ival(count + t_nnz)) if (field == "complex") allocate(cval(count + t_nnz)) ! copy the elements regular and transposed locations j = 1 do i = 1, t_nnz indx(j) = t_indx(i) jndx(j) = t_jndx(i) if (field == "real" ) rval(j) = t_rval(i) if (field == "integer") ival(j) = t_ival(i) if (field == "complex") cval(j) = t_cval(i) j = j + 1 if (t_indx(i) /= t_jndx(i)) then indx(j) = t_jndx(i) jndx(j) = t_indx(i) if (field == "real" ) rval(j) = t_rval(i) if (field == "integer") ival(j) = t_ival(i) if (field == "complex") cval(j) = t_cval(i) j = j + 1 end if end do t_nnz = t_nnz + count else allocate(indx(t_nnz)) allocate(jndx(t_nnz)) allocate(cval(t_nnz)) indx(:) = t_indx(:) jndx(:) = t_jndx(:) if (field == "real" ) then allocate(rval(t_nnz)) rval(:) = t_rval(:) end if if (field == "integer") then allocate(ival(t_nnz)) ival(:) = t_ival(:) end if if (field == "complex") then allocate(cval(t_nnz)) cval(:) = t_cval(:) end if end if end subroutine set_matrix subroutine mm_matrix_read(iunit1, extendSymMatrix) ! Read Matrix Market format file implicit none integer, allocatable, dimension(:) :: t_indx, t_jndx integer, allocatable, dimension(:) :: t_ival real(fp_kind), allocatable, dimension(:) :: t_rval complex*16, allocatable, dimension(:) :: t_cval integer :: i, iunit1, count integer :: extendSymMatrix ! 1 = extend , 0 = none ! Read header infos call mminfo(iunit1, rep, field, symm, rows, cols, nnz) ! small test data ! ! ! | 1 2 0 | ! ! A = | 0 5 0 | base = 1 ! ! | 0 8 0 | ! rep = "coordinate" ! field = "real" ! symm ="general" ! rows = 3 ; cols = 3 ; nnz = 4 ! indx(1:4) = (/3, 2, 1, 1/) ! jndx(1:4) = (/2, 2, 1, 2/) ! rval(1:4) =(/8.0, 5.0, 1.0, 2.0/) ! for small test data nnzmax = nnz allocate (t_indx(nnz), t_jndx(nnz)) if (field == "complex") then allocate (t_cval(nnz)) call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, & t_indx, t_jndx, cval=t_cval) call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, & t_cval=t_cval) elseif (field == "real") then allocate (t_rval(nnz)) call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, & t_indx, t_jndx, rval=t_rval) call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, & t_rval=t_rval) elseif (field == "integer") then allocate (t_ival(nnz)) call mmread(iunit1, rep, field, symm, rows, cols, nnz, nnzmax, & t_indx, t_jndx, ival=t_ival) call set_matrix(iunit1, extendSymMatrix, field, symm, nnz, t_indx, t_jndx, & t_ival=t_ival) elseif (field == "pattern") then print *, "Pattern field is not available." ; stop "10" else print *, "Error : data type not found." , field ; stop "10" end if deallocate(t_indx,t_jndx) if (field == "real" ) deallocate(t_rval) if (field == "integer") deallocate(t_ival) if (field == "complex") deallocate(t_cval) print *,"Properties of Matrix_Market matrix" print *,"====================================" print '(3(1x,a,1x,a/),4(1x,a,i8/))', & "representation =",rep, & "type =",field,& "matrix type =",symm, & "row (m) =",rows, & "cols(n) =",cols, & "Non-zero elm. =",nnz, & "extendSymMatrix=",extendSymMatrix if (symm == "symmetric" .and. extendSymMatrix == 1) & print '(/a/)', " ==== Triangular Symmetric matrix has been transformed to full matrix form ====" if (field == "complex") then call matrix_printout (field, nnz, indx, jndx, cval1=cval) elseif (field == "real") then call matrix_printout (field, nnz, indx, jndx, rval1=rval) elseif (field == "integer") then call matrix_printout (field, nnz, indx, jndx, ival1=ival) print *, "No available in case of field = integer in this routine" stop "20" elseif (field == "pattern") then print *, "No available in case of field = pattern in this routine" stop "30" end if close(iunit1) end subroutine mm_matrix_read subroutine mm_rhs_read(iunit2) ! INPUT : vector file (Matrix Market format) ! OUTPUT : rhs(:) -- global array use mmio implicit none integer :: i, iunit2 character*40 :: rhs_file integer :: rows1, cols1, nnz1, nnzmax1 character*10 :: rep1 character*7 :: field1 character*19 :: symm1 integer, allocatable, dimension(:) :: indx1, jndx1 call mminfo(iunit2, rep1, field1, symm1, rows1, cols1, nnz1) if (rows1 /= rows) then print *, "Error : RHS vector size does match with Matrix row size." stop end if print *,"Properties of Matrix_Market RHS vector" print *,"====================================" print '(3(1x,a,1x,a/),3(1x,a,i8/))', & "representation =",rep1, & "type =",field1,& "matrix type =",symm1, & "row (m) =",rows1, & "cols(n) =",cols1, & "Non-zero elm. =",nnz1 nnzmax1 = nnz1 allocate (indx1(rows1), jndx1(cols1)) allocate (rhs(nnz1)) call mmread(iunit2, rep1, field1, symm1, rows1, cols1, nnz1, nnzmax1, & indx1, jndx1, rval=rhs) ! print *,"rhs=", rhs deallocate (indx1, jndx1) close(iunit2) end subroutine mm_rhs_read subroutine coo2csr_convert ! INPUT : Matrix COO data from Matrix Market format via mm_read routine ! OUTPUT : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data use cublas use cusparse !! fortran program as index_base = 1 type(cusparseHandle) :: cusparse_H type(cublasHandle) :: cublas_H integer(8) :: buff_size ! must be 8byte integer. so needs -Mlarge_arrays option character*1,allocatable,dimension(:) :: pBuffer integer, allocatable,dimension(:) :: P integer :: status integer :: ierr_code integer :: i if ( rows /= cols ) then print *, "Error : Matrix must be square (rows = cols) in this routine." stop end if ! set m, n, nz m = rows n = cols ! matrix size nz = nnz ! nz == nnz if ( m /= n ) then print *, "Error : Matrix A is not square matrix ",m,"x",n stop end if ! set CSR storage arrays allocate(csrRowPtrA(m+1)) allocate(csrColIndA(nz) ) allocate(csrValA(nz) ) ! Initilize csrRowPtrA =0 ; csrColIndA =0 ; csrValA = 0_fp_kind ! index base is changed from 1 to 0 for using cusparseXcoosortByRow routine do i = 1, nz indx(i) = indx(i) - 1 jndx(i) = jndx(i) - 1 end do !! call matrix_printout (field, nnz, indx, jndx, rval1=rval) ! create handle for CUSPARSE status = cusparseCreate(cusparse_H) !$acc data copy(indx, jndx) copy(rval, csrValA, csrRowPtrA) ! calculate buffer size to conver !$acc host_data use_device(indx, jndx) status = cusparseXcoosort_bufferSizeExt(cusparse_H, m, n, nz, indx, jndx, buff_size) !$acc end host_data print *, "Xcoosort_bufferSizeExt status =", status ! Prepare working space print *, "working space (bytes) =", buff_size allocate(pBuffer(buff_size), stat=ierr_code) print *, "allocation status code for pBuffer =", ierr_code ! setup permutation vector P allocate(P(nz), stat=ierr_code) print *, "allocation status code for P =", ierr_code !$acc enter data copyin(P, pBuffer) ! P, pBuffer enter in ! Creates an identity map !$acc host_data use_device(P) status = cusparseCreateIdentityPermutation(cusparse_H, nz, P) !$acc end host_data print *, "CreateIdentityPermutation status =", status ! sorts COO format by row !$acc host_data use_device(indx, jndx, P, pBuffer) status = cusparseXcoosortByRow(cusparse_H, m, n, nz, indx, jndx, P, pBuffer) !$acc end host_data print *, "XcoosortByRow status =", status ! gather sorted COO Vals = set "csrValA" !$acc host_data use_device(rval, csrValA, P) status = cusparseDgthr(cusparse_H, nz, rval, csrValA, P, CUSPARSE_INDEX_BASE_ZERO) !$acc end host_data !$acc exit data delete(P, pBuffer) ! P, Pbuffer deleted ! index base is changed from 0 to 1 for subsequent Fortran cusolver routines on device !$acc kernels do i = 1, nz indx(i) = indx(i) + 1 jndx(i) = jndx(i) + 1 end do !$acc end kernels ! COO to CSR format = set "csrRowPtrA" !$acc host_data use_device(indx, csrRowPtrA) status = cusparseXcoo2csr(cusparse_H, indx, nz, m, csrRowPtrA, CUSPARSE_INDEX_BASE_ONE) !$acc end host_data !$acc end data deallocate(P, pBuffer) ! copy jndx to csrColIndA = set "csrColIndA" csrColIndA = jndx if (field == "complex") then call matrix_printout (field, nz, indx, jndx, cval1=cval) elseif (field == "real") then call matrix_printout (field, nz, indx, jndx, rval1=csrValA) end if ! check CSR data call csr_printout end subroutine coo2csr_convert end module mm_coo_cuda_csr_convert
メインプログラムを以下に示す。このプログラムは、複素数の係数行列は対応していない。実数型のデータのみを扱える。最初に、Matrix Market 形式のファイル名を指定し、次に right hand side(rhs) データファイルがあれば、それを指定する。もし、存在しなければ、rhs ベクトルは全て 1.0 として計算する。入力データの読み込み後、coo2csr_convert ルーチンを call して、COO 形式データからCSR 形式データに変換する。変換後、csrRowPtrA、csrColIndA、csrValA 配列が作成され、この CSR 形式の行列を実引数として、BiCGStab ルーチンに渡し、計算が始まる。bicgstab.f90 ルーチンの内容に関しては、前章 12-2 を参照していただきたい。
program main
  use precision
  use Matrix_data
  use mm_coo_cuda_csr_convert
  use bicg_routine
  implicit none
    integer                  :: i, j, istat
    integer,parameter        :: iunit1 = 10       ! for Matrix Market data (COO)
    integer,parameter        :: iunit2 = 11       ! for LHS, if available
    character*50             :: mm_file, rhs_file
    integer                  :: extendSymMatrix   ! 1 = symmetrize the pattern (transform from triangular to full)
    integer, parameter       :: max_iter = 10000  ! max_iteration
    real(fp_kind), parameter :: tol = 1.0d-13     ! tolerance
    integer, parameter       :: index_base = 1    ! CSR format -- one based index
! Read matrix Market data file
    extendSymMatrix = 1        !  if =1 .and. symmetric matrix=yes , transform from upper triangular to full
    print *, "File name for Matrix Market?"
    read (5, '(a)')  mm_file
    !! mm_file ="DATA/bcsstk14.mtx"
    print *, "Matrix Market file name = ",mm_file
    open (iunit1, file=mm_file, status='old')
    call mm_matrix_read(iunit1, extendSymMatrix)
! Read right hand side vectors, if available.
    print *
    print *, "File name for LHS vector? if no available, press ENTER and set as RHS=1.0."
    read (5, '(a)')  rhs_file
    if (rhs_file == "") then
       allocate (rhs(rows))
       rhs= 1.0_fp_kind
    else
      print *, "rhs_file name = ",rhs_file
      open (iunit2, file=rhs_file, status='old')
      call mm_rhs_read(iunit2)
    end if
! Convert COO to CSR sparse matrix
!   INPUT   : Matrix COO data from Matrix Market format via mm_rhs_read routine
!   OUTPUT  : CSR data = csrRowPtrA(m+1), csrColIndA(nz), csrValA(nz) -- global data
    call coo2csr_convert
! CSR sparse matrix data are set. The belows are arguments for BiCGStab routine.
! Arrays for CUDA cuSPARSE CSR format
!   integer       :: m                 IN    ! rows size for square matrix
!   integer       :: n                 IN    ! columns size for square matrix (n=m)
!   integer       :: nz                IN    ! nonzero element of A
!   integer       :: csrRowPtrA (m+1)  IN    ! the array of offsets corresponding to the start of each row
!   integer       :: csrColIndA (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: csrValA    (nz)   IN    ! sorted by row and by column within each row
!   real(fp_kind) :: x(m)              INOUT ! initial guess & result
!   real(fp_kind) :: rhs(m)            IN    ! right hand side
! Print CSR Matrix data
!    print *, "csrValA", csrValA
!    print *, "csrRowPtrA",csrRowPtrA
!    print *, "csrColIndA", csrColIndA
! Initilize X
    allocate (x(n))
    x = 0.0_fp_kind      ! or Initial guess X is set
! Compute by BiCGStab method
! Preconditioned Conjugate Gradient using Incomplete-LU
! factorization with 0 fill-in and no pivoting
    print '(/1x,a,d15.10,2x,a,i8/)'," tol = ", tol, "max_iteration =", max_iter
    call BiCGStab (n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                   tol, max_iter, index_base)
end program main
		! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module bicg_routine
    use precision
    implicit none
  contains
!=============================================================
    subroutine BiCGStab(n, nz, csrRowPtrA, csrColIndA, csrValA, x, rhs, &
                        tol, max_iter, index_base)
!-------------------------------------------------------------
!   Bi-Conjugate Gradient Stabilized (BiCGStab)
!   A is an n × n sparse matrix defined in CSR storage format
!-------------------------------------------------------------
    use precision
    use measure_time
    use cublas_v2
    use cusparse
    implicit none
! Arrays for CUDA cuSPARSE CSR format
    integer       :: n                ! size of squre matrix
    integer       :: nz               ! nonzero element of A
    integer       :: csrRowPtrA(n+1)  ! the array of offsets corresponding to the start of each row
    integer       :: csrColIndA (nz)  ! sorted by row and by column within each row
    real(fp_kind) :: csrValA    (nz)  ! sorted by row and by column within each row
    real(fp_kind) :: x  (n)           ! initial guess & result
    real(fp_kind) :: rhs(n)           ! right hand side
    integer       :: index_base       ! = 1
    integer       :: max_iter
    real(fp_kind) :: tol
! Local variables
    type(cusparseHandle)     :: cusparse_H
    type(cublasHandle)       :: cublas_H
    type(cusparseMatDescr)   :: descrA
    type(cusparseMatDescr)   :: descrL
    type(cusparseMatDescr)   :: descrU
    type(cusparseSolveAnalysisInfo) :: infoA, infoL, infoU
    integer                  :: cusparse_status
    integer                  :: cublas_status
    integer                  :: status, ierr_code
    real(fp_kind)            :: alpha, alpham1, beta, omega, omegam1
    real(fp_kind)            :: rho, rho_p
    real(fp_kind)            :: nrmr, nrmr0, temp1, temp2, nrmCK1, nrmCK2
    real(fp_kind)            :: error, rsum, diff
    real(fp_kind), allocatable, dimension(:) :: valsILU0
    real(fp_kind), allocatable, dimension(:) :: Lov, z
    real(fp_kind), allocatable, dimension(:) :: r, r_tlda
    real(fp_kind), allocatable, dimension(:) :: p
    real(fp_kind), allocatable, dimension(:) :: q
    real(fp_kind), allocatable, dimension(:) :: t, s
    real(fp_kind),parameter::  one = 1.0, zero = 0.0, onem1 = -1.0
    integer                  :: i, j, k
    integer                  :: nprint
    integer, parameter       :: idbg = 2
    real(fp_kind),parameter  :: eps = 6.9d-310     ! machine epsilon
    allocate( r(n), p(n), r_tlda(n) )
    allocate( q(n) )
    allocate( valsILU0(nz) )
    allocate( Lov(n), z(n), t(n) , s(n) )
! Bi-Conjugate Gradient Stabilized (BiCGStab) using incomplete-LU
    print '(/1x,a)',"-----------------------------------------------"
    print '(1x,a)', " Bi-Conjugate Gradient Stabilized (BiCGStab)"
    print *,        "-----------------------------------------------"
!   print *,"rhs=",rhs
    do  i = 1, n
       r(i) = rhs(i)     ! set r = rhs, where Ax = rhs (Ax = b)
    end do
    alpha   = 1.0_fp_kind
    beta    = 0.0_fp_kind
    rho     = 0.0_fp_kind
    k = 0
! Clock
      status = init_clock
      status = start_clock
! creater handle for cuBLAS
    status = cublasCreate(cublas_H)
! creater handle for CUSPARSE
    status = cusparseCreate(cusparse_H)
! matrix A descriptor(descrA) and properties
    status = cusparseCreateMatDescr(descrA)
    status = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)  ! must be GENERAL for using csrilu0
    status = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE)  ! base = 1
! initiialize the analysis info for A matrix (infoA = sparsity pattern of A)
    status = cusparseCreateSolveAnalysisInfo(infoA)
! initiialize the L U preconditioner Info
    status = cusparseCreateSolveAnalysisInfo(infoL)
    status = cusparseCreateSolveAnalysisInfo(infoU)
! Setup for Lower Matrix
    status = cusparseCreateMatDescr (descrL)
    status = cusparseSetMatType     (descrL, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrL, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrL, CUSPARSE_FILL_MODE_LOWER)
    status = cusparseSetMatDiagType (descrL, CUSPARSE_DIAG_TYPE_UNIT)
! Setup for Upper Matrix
    status = cusparseCreateMatDescr (descrU)
    status = cusparseSetMatType     (descrU, CUSPARSE_MATRIX_TYPE_GENERAL)
    status = cusparseSetMatIndexBase(descrU, CUSPARSE_INDEX_BASE_ONE)
    status = cusparseSetMatFillMode (descrU, CUSPARSE_FILL_MODE_UPPER)
    status = cusparseSetMatDiagType (descrU, CUSPARSE_DIAG_TYPE_NON_UNIT)
! copy original vals to new ILU0-vals
    valsILU0 = csrValA
!print *,"X=",X
!print *,"r_init=",r
!Print CSR Matrix data
!   print *,"m,n,nz",m,n,nz
!   print *, "csrValA", csrValA
!   print *, "csrRowPtrA",csrRowPtrA
!   print *, "csrColIndA", csrColIndA
!   print *,"csrValA=",(csrValA(i),i=1,100)
!$acc data copy(r, X)  copyin(valsILU0, csrValA, csrRowPtrA, csrColIndA ) &
!$acc create(p, q, Lov, z, r_tlda ) create(t, s)
   ! === Initial guess x0 and residual r = b - Ax0 where r = b
   !$acc host_data use_device( csrValA, csrRowPtrA, csrColIndA, X, r, r_tlda, p )
     cusparse_status = cusparseDcsrmv(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                       n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, X, zero, q) ! q = A x0, q is temporary
     status = cublasDaxpy(cublas_H, n, onem1, q, 1, r, 1)                                  ! r = r - A x0
     ! set as r_tlda = r and p = r
     status = cublasDcopy(cublas_H, n, r, 1, r_tlda, 1)                                    ! r_tlda = r
     status = cublasDcopy(cublas_H, n, r, 1, p, 1)                                         ! p      = r
     status = cublasDnrm2(cublas_H, n, r, 1, nrmr0)                                        ! nrmr0 = ||r||2
   !$acc end host_data
    print *,"nrmr0=",nrmr0, "tol*nrmr0=",tol*nrmr0
   !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, valsILU0)
    ! the analysis of the sparsity pattern for the Non-Transpose case (triangular linear system) infoL/descrL
     cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                      n, nz, descrL, csrValA, csrRowPtrA, csrColIndA, infoL)
!print*,"cusparseDcsrsv_analysis1=", cusparse_status
    ! the analysis phase of the solution of a sparse linear system for infoU/descrU
     cusparse_status = cusparseDcsrsv_analysis(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, &
                      n, nz, descrU, csrValA, csrRowPtrA, csrColIndA, infoU)
    !print*,"cusparseDcsrsv_analysis2=",cusparse_status
    ! Incomplete-LU factorization with 0 level fill-in and no pivoting for A matrix
    ! output : valsILU0
     cusparse_status = cusparseDcsrilu0(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, descrA, valsILU0, &
                      csrRowPtrA, csrColIndA, infoL);
   !$acc end host_data
!$acc wait
    ! check for the solution using valsILU0 vector from Incomplete-LU factorization
    !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov)
     cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                            valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov)
     status = cublasDnrm2(cublas_H, n, Lov, 1, nrmCK1)
     cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                            valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z)  ! solve z
     status = cublasDnrm2(cublas_H, n, z, 1, nrmCK2)
    !$acc end host_data
     print*,"nrmCK1, nrmCK2 =", nrmCK1, nrmCK2
     if (nrmCK1 == 0.0  .or. nrmCK2 == 0.0 .or. nrmCK1 /= nrmCK1 .or. nrmCK2 /= nrmCK2) then
       print *, "ERROR : This matrix is not be able to solve using valsILU0 vector from Incomplete-LU factorization."
       stop
     end if
        status = end_clock
        time_tag = "The analysis and Incomplete-LU factorization phase of ILU(0) :"
        status = print_time(time_tag)
        status = start_clock
    do while (  k <= max_iter )
      k = k + 1
      rho_p = rho
   !$acc host_data use_device(r_tlda, r)
      status = cublasDdot(cublas_H, n, r_tlda, 1, r, 1, rho)               ! rho = (r_tlda, r)
   !$acc end host_data
      if ( abs(rho) < eps .or. rho == 0.0 ) then
         print *,"ERROR! rho is zero", rho, "BiCG calculation condition does not hold. STOP!"
         exit
       end if
      if ( k > 1 ) then
      if ( abs(omega) < eps .or. rho == 0.0 ) then
        print *,"ERROR! omega is zero, BiCG calculation condition does not hold. STOP!"
        exit
      end if
        beta = (rho/rho_p) * (alpha/omega)
        omegam1 = -omega
      !$acc host_data use_device(q, p, r)
        cublas_status = cublasDaxpy(cublas_H, n, omegam1, q, 1, p, 1)   ! p = p -omega * q
        cublas_status = cublasDscal(cublas_H, n, beta, p, 1)            ! p = beta * p
        cublas_status = cublasDaxpy(cublas_H, n, one, r, 1, p, 1)       ! p = p + r
      !$acc end host_data
      end if
      ! Solve using valsILU0 vector from Incomplete-LU factorization,  Solve M z = p
      !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, p, Lov, z)
      ! Forwadaard Solve, L part for infoA(the sparsity pattern of descL)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoL, p, Lov)
      ! Back Substitution (U part for infoU = descU)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, z)  ! solve z
      !$acc end host_data
      ! Compute q = A z
      !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, r_tlda, z, q)
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &
                          n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, z, zero, q)   ! q = A z
        status = cublasDdot(cublas_H, n, r_tlda, 1, q, 1, temp1)        ! temp1 = (r_tlda, q)
      !$acc end host_data
        alpha   = rho / temp1                                           ! alpha = rho / (r_tlda, q)
        alpham1 = -alpha
      !$acc host_data use_device(q, z, X, r)
        cublas_status = cublasDaxpy(cublas_H, n, alpham1, q, 1, r, 1)   ! r = r - alpha * q
        cublas_status = cublasDaxpy(cublas_H, n, alpha,   z, 1, X, 1)   ! X = X + alpha * z
        cublas_status = cublasDnrm2(cublas_H, n, r, 1, nrmr)            ! nrmr = || r ||2
      !$acc end host_data
      ! check for convergence
        if ( nrmr < tol*nrmr0 ) exit
      ! solve M s = r
      !$acc host_data use_device(valsILU0, csrRowPtrA, csrColIndA, r, Lov, s, t)
      ! Forward Solve, L part for infoA(the sparsity pattern of descL)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrL, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoL, r, Lov)
      ! Back Substitution (U part for infoU = descU)
        cusparse_status = cusparseDcsrsv_solve(cusparse_H, CUSPARSE_OPERATION_NON_TRANSPOSE, n, one, descrU, &
                                              valsILU0, csrRowPtrA, csrColIndA, infoU, Lov, s)  ! solve s
      ! Compute t = A s
        cusparse_status = cusparseDcsrmv(cusparse_H,CUSPARSE_OPERATION_NON_TRANSPOSE, &         ! t = A s
                          n, n, nz, one, descrA, csrValA, csrRowPtrA, csrColIndA, s, zero, t)
        status = cublasDdot(cublas_H, n, t, 1, r, 1, temp1)             ! temp1 = (t, r)
        status = cublasDdot(cublas_H, n, t, 1, t, 1, temp2)             ! temp2 = (t, t)
      !$acc end host_data
        omega = temp1/temp2
        omegam1 = -omega
      !$acc host_data use_device(s, t, X, r)
        status = cublasDaxpy(cublas_H, n, omega, s, 1, X, 1)            ! X = X + omega * s
        status = cublasDaxpy(cublas_H, n, omegam1, t, 1, r, 1)          ! r = r - omega * t
        status = cublasDnrm2(cublas_H, n, r, 1, nrmr)                   ! nrmr = || r ||2
      !$acc end host_data
      ! check for convergence
        if ( nrmr < tol*nrmr0 ) exit
        if (idbg == 1 ) print *, "iteration=", k,  " residual =", nrmr
    end do
!$acc end data
    print *, "iteration=", k,  " residual =", nrmr
    print *, "nrmr/nrmr0 =", nrmr/nrmr0
    print *
    if ( nrmr < tol*nrmr0 ) then
      print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED."
    else
      print *, "Bi-Conjugate Gradient Stabilized (BiCGStab) is FAILED."
    end if
        status = end_clock
        time_tag = "Elapased time for BiGGstab loop with ILU(0) :"
        status = print_time(time_tag)
! check result
    error = 0.0
    do i = 1, n
      rsum = 0.0
      do  j =  csrRowPtrA(i), csrRowPtrA(i+1)-1
         rsum = rsum + csrValA(j) * X(csrColIndA(j))
      end do
      diff = abs(rsum - rhs(i))
      if (diff > error) error = diff
    end do
    print '(1x,a,d20.13/)',"error =",error
! output result
    if (idbg >= 1) then
      nprint = n
      if ( n > 10 ) nprint = 10
      do i = 1, nprint
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
      print *,""
      do i = n-nprint, n
         print '(1x, a,i7,a,e18.12)', "x(", i, ")=", X(i)
      end do
    end if
    status = cublasDestroy(cublas_H)
    status = cusparseDestroy(cusparse_H)
    status = cusparseDestroyMatDescr(descrA)
    status = cusparseDestroyMatDescr(descrL)
    status = cusparseDestroyMatDescr(descrU)
    status = cusparseDestroySolveAnalysisInfo(infoA)
    status = cusparseDestroySolveAnalysisInfo(infoL)
    status = cusparseDestroySolveAnalysisInfo(infoU)
    deallocate( q, r, r_tlda, p )
    deallocate( Lov, z, valsILU0, t, s )
    end subroutine BiCGStab
End module bicg_routine
	tar file 内に提供した Makefile の記述に沿って、コンパイル&リンクを行う。なお、この Makefile では、cuda9.0 ツールキットのライブラリを使用するように設定しているが、cuda8.0 を使用しても良い。cusparse と cublas ライブラリは、PGIコンパイラにバンドルしているルーチンを使用する形になっている。
$ make
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  precision.f90
pgf90 -c -Minfo -Kieee -O  -Mfixed mmio.f
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  mm_convert_csr.f90
matrix_printout:
     68, Loop not vectorized/parallelized: contains call
     77, Loop not vectorized/parallelized: contains call
set_matrix:
    138, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    148, Loop not vectorized/parallelized: contains call
    169, Memory copy idiom, loop replaced by call to __c_mcopy4
    170, Memory copy idiom, loop replaced by call to __c_mcopy4
    173, Memory copy idiom, loop replaced by call to __c_mcopy8
    177, Memory copy idiom, loop replaced by call to __c_mcopy4
    181, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    356, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    359, Generated an alternate version of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    369, Generating copy(csrvala(:),indx(:),rval(:),jndx(:),csrrowptra(:))
    387, Generating enter data copyin(p(:),pbuffer(:))
    406, Generating exit data delete(p(:),pbuffer(:))
    410, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        410, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    426, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  bicgstab.f90
bicgstab:
     93, Memory copy idiom, loop replaced by call to __c_mcopy8
    138, Memory copy idiom, loop replaced by call to __c_mcopy8
    149, Generating create(q(:))
         Generating copyin(valsilu0(:))
         Generating create(t(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating create(z(:))
         Generating copy(x(:),r(:))
         Generating create(s(:),r_tlda(:))
    207, Loop not vectorized/parallelized: potential early exits
    318, Generated an alternate version of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         FMA (fused multiply-add) instruction(s) generated
    330, Loop not vectorized/parallelized: contains call
    334, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse,cublas -Mlarge_arrays  main.f90
main:
     36, Memory set idiom, loop replaced by call to __c_mset8
     68, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.o mmio.o measure_time.o mm_convert_csr.o bicgstab.o main.o -Mcuda -acc -ta=tesla,cuda8.0,cc35,cc60
サンプル係数行列(mtxデータ)は、The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) から入手する。NVIDIA の Incomplete-LU and Cholesky Preconditioned Iterative Methods Using cuSPARSE and cuBLAS というホワイトペーパーにテストで使用されている係数行列の名前が以下のようにリストアップされている。この名前を The SuiteSparse Matrix Collection 内の検索サイトで探すとテストデータが表示され、「Matrxi Market」形式ファイルをダウンロードすれば、mtx ファイルが入手できる。なお、係数行列の素性によっては解けない場合もある。
| # | Matrix Name | m,n | nnz | s.p.d. | Application | 
| 1. | offshore | 259,789 | 4,242,673 | yes | Geophysics:https://sparse.tamu.edu/Um/offshore | 
| 2. | af_shell3 | 504,855 | 17,562,051 | yes | Mechanics:https://sparse.tamu.edu/Schenk_AFE/af_shell3 | 
| 3. | parabolic_fem | 525,825 | 3,674,625 | yes | General:https://sparse.tamu.edu/Wissgott/parabolic_fem | 
| 4. | apache2 | 715,176 | 4,817,870 | yes | Mechanics:https://sparse.tamu.edu/GHS_psdef/apache2 | 
| 5. | ecology2 | 999,999 | 4,995,991 | yes | Biology:https://sparse.tamu.edu/McRae/ecology2 | 
| 6. | thermal2 | 1,228,045 | 8,580,313 | yes | Thermal Simulation:https://sparse.tamu.edu/Schmid/thermal2 | 
| 7. | G3_circuit | 1,585,478 | 7,660,826 | yes | Circuit Simulation:https://sparse.tamu.edu/AMD/G3_circuit | 
| 8. | FEM_3D_thermal2 | 147,900 | 3,489,300 | no | Mechanics:https://sparse.tamu.edu/Botonakis/FEM_3D_thermal2 | 
| 10. | ASIC_320ks | 321,671 | 1,316,08511 | no | Circuit Simulation:https://sparse.tamu.edu/Sandia/ASIC_320ks | 
| 11. | cage13 | 445,315 | 7,479,343 | no | Biology:https://sparse.tamu.edu/vanHeukelum/cage13 | 
| 12. | atmosmodd | 1,270,432 | 8,814,880 | no | Atmospheric Model:https://sparse.tamu.edu/Bourchtein/atmosmodd | 
make run コマンドで実行が開始される。入力 mtx ファイルが求められるので、入手した mtx ファイル名を指定する。次に rhs ベクトルファイルが存在する場合は、そのファイル名を指定する。存在しない場合は、enter を押下し、実行に進む。もし、反復法の計算が不能となった場合は、cuSOLVER SP ルーチンを利用した直接解法で計算してみることもお勧めする。
[kato@photon32]$ make run
===== Program run =====
a.out
File name for Matrix Market?
lap2D_5pt_n100.mtx
 Matrix Market file name = lap2D_5pt_n100.mtx
 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =   10000
 cols(n)        =   10000
 Non-zero elm.  =   49600
 extendSymMatrix=       1
 ==== Triangular Symmetric matrix has been transformed to full matrix form ====
 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            2            1   -1.000000000000000
 i,indx,jndx            3            1            2   -1.000000000000000
 i,indx,jndx            4          101            1   -1.000000000000000
 i,indx,jndx            5            1          101   -1.000000000000000
 i,indx,jndx            6            2            2    4.000000000000000
 i,indx,jndx            7            3            2   -1.000000000000000
 i,indx,jndx            8            2            3   -1.000000000000000
 i,indx,jndx            9          102            2   -1.000000000000000
 i,indx,jndx           10            2          102   -1.000000000000000
 i,indx,jndx        49590         9996         9997   -1.000000000000000
 i,indx,jndx        49591         9997         9997    4.000000000000000
 i,indx,jndx        49592         9998         9997   -1.000000000000000
 i,indx,jndx        49593         9997         9998   -1.000000000000000
 i,indx,jndx        49594         9998         9998    4.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9998         9999   -1.000000000000000
 i,indx,jndx        49597         9999         9999    4.000000000000000
 i,indx,jndx        49598        10000         9999   -1.000000000000000
 i,indx,jndx        49599         9999        10000   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000
 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.
 Xcoosort_bufferSizeExt status =            0
 working space (bytes) =                   794880
 allocation status code for pBuffer =            0
 allocation status code for P =            0
 CreateIdentityPermutation status =            0
 XcoosortByRow status =            0
 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            1            2   -1.000000000000000
 i,indx,jndx            3            1          101   -1.000000000000000
 i,indx,jndx            4            2            1   -1.000000000000000
 i,indx,jndx            5            2            2    4.000000000000000
 i,indx,jndx            6            2            3   -1.000000000000000
 i,indx,jndx            7            2          102   -1.000000000000000
 i,indx,jndx            8            3            2   -1.000000000000000
 i,indx,jndx            9            3            3    4.000000000000000
 i,indx,jndx           10            3            4   -1.000000000000000
 i,indx,jndx        49590         9998         9898   -1.000000000000000
 i,indx,jndx        49591         9998         9997   -1.000000000000000
 i,indx,jndx        49592         9998         9998    4.000000000000000
 i,indx,jndx        49593         9998         9999   -1.000000000000000
 i,indx,jndx        49594         9999         9899   -1.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9999         9999    4.000000000000000
 i,indx,jndx        49597         9999        10000   -1.000000000000000
 i,indx,jndx        49598        10000         9900   -1.000000000000000
 i,indx,jndx        49599        10000         9999   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000
 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          4          8         12         16         20         24
         28         32         36
 csrColIndA=
          1          2        101          1          2          3        102
          2          3          4
 csrValA  =
 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01
 -.1000D+01 0.4000D+01 -.1000D+01
 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
      49562      49566      49570      49574      49578      49582      49586
      49590      49594      49598      49601
 csrColIndA=
       9898       9997       9998       9999       9899       9998       9999
      10000       9900       9999      10000
 csrValA=
 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01
 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01
  tol = .1000000000D-12  max_iteration =   10000
 -----------------------------------------------
  Bi-Conjugate Gradient Stabilized (BiCGStab)
 -----------------------------------------------
 nrmr0=    100.0000000000000      tol*nrmr0=   1.0000000000000001E-011
 nrmCK1, nrmCK2 =    238.2158421093953         166.7314075873577
 The analysis and Incomplete-LU factorization phase of ILU(0) :                         0.212 (sec)
 iteration=           81  residual =   6.6059148572793421E-012
 nrmr/nrmr0 =   6.6059148572793405E-014
 Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED.
 Elapased time for BiGGstab loop with ILU(0) :                                          0.961 (sec)
 error = 0.6593836587854D-11
 x(      1)=0.275607474398E+01
 x(      2)=0.501214948795E+01
 x(      3)=0.696587678467E+01
 x(      4)=0.871021429238E+01
 x(      5)=0.102960547760E+02
 x(      6)=0.117547659017E+02
 x(      7)=0.131074411873E+02
 x(      8)=0.143691932892E+02
 x(      9)=0.155513693040E+02
 x(     10)=0.166627976257E+02
 x(   9990)=0.177105403722E+02
 x(   9991)=0.166627976257E+02
 x(   9992)=0.155513693040E+02
 x(   9993)=0.143691932892E+02
 x(   9994)=0.131074411873E+02
 x(   9995)=0.117547659017E+02
 x(   9996)=0.102960547760E+02
 x(   9997)=0.871021429238E+01
 x(   9998)=0.696587678467E+01
 x(   9999)=0.501214948795E+01
 x(  10000)=0.275607474398E+01
	上記で示したプログラム一式は、Linux 上で動作させることを前提としたものである。同じプログラムを Windows 上で動作させるために Makefile を変更したものとソース・ファイル一式を以下に提供する。以下のファイルをダウンロードしていただきたい。
プログラム tar ファイル:BiCGStab.zip
以下の例は、PGI 17.10 バージョンの PGI コマンドプロンプト(端末)を使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。
$ pgfortran -V pgfortran 17.10-0 64-bit target on x86-64 Windows -tp sandybridge PGI Compilers and Tools Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
また、使用した GPU は以下のとおりである。
$ pgaccelinfo CUDA Driver Version: 9000 Device Number: 0 Device Name: GeForce GTX 780 Device Revision Number: 3.5 Global Memory Size: 3221225472 Number of Multiprocessors: 12 Number of SP Cores: 2304 Number of DP Cores: 768 Concurrent Copy and Execution: Yes Total Constant Memory: 65536 Total Shared Memory per Block: 49152 Registers per Block: 65536 Warp Size: 32 Maximum Threads per Block: 1024 Maximum Block Dimensions: 1024, 1024, 64 Maximum Grid Dimensions: 2147483647 x 65535 x 65535 Maximum Memory Pitch: 2147483647B Texture Alignment: 512B Clock Rate: 941 MHz Execution Timeout: Yes Integrated Device: No Can Map Host Memory: Yes Compute Mode: default Concurrent Kernels: Yes ECC Enabled: No Memory Clock Rate: 3004 MHz Memory Bus Width: 384 bits L2 Cache Size: 1572864 bytes Max Threads Per SMP: 2048 Async Engines: 1 Unified Addressing: Yes Managed Memory: Yes PGI Compiler Option: -ta=tesla:cc35
PGI$ make
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  precision.f90
pgf90 -c -Minfo -Kieee -O  -Mfixed mmio.f
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  measure_time.f90
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  mm_convert_csr.f90
matrix_printout:
     68, Loop not vectorized/parallelized: contains call
     77, Loop not vectorized/parallelized: contains call
set_matrix:
    138, Unrolled inner loop 8 times
         Generated 2 prefetch instructions for the loop
    148, Loop not vectorized/parallelized: contains call
    169, Memory copy idiom, loop replaced by call to __c_mcopy4
    170, Memory copy idiom, loop replaced by call to __c_mcopy4
    173, Memory copy idiom, loop replaced by call to __c_mcopy8
    177, Memory copy idiom, loop replaced by call to __c_mcopy4
    181, Memory copy idiom, loop replaced by call to __c_mcopy16
coo2csr_convert:
    356, Memory zero idiom, loop replaced by call to __c_mzero8
         Memory zero idiom, loop replaced by call to __c_mzero4
    359, Generated 3 alternate versions of the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
         Generated vector simd code for the loop
         Generated 2 prefetch instructions for the loop
    369, Generating copy(csrvala(:),indx(:),jndx(:),rval(:),csrrowptra(:))
    387, Generating enter data copyin(p(:),pbuffer(:))
    406, Generating exit data delete(pbuffer(:),p(:))
    410, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
        410, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
    426, Memory copy idiom, loop replaced by call to __c_mcopy4
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  bicgstab.f90
bicgstab:
     93, Memory copy idiom, loop replaced by call to __c_mcopy8
    138, Memory copy idiom, loop replaced by call to __c_mcopy8
    149, Generating create(q(:))
         Generating copyin(valsilu0(:))
         Generating create(t(:))
         Generating copyin(csrrowptra(:),csrcolinda(:))
         Generating create(lov(:),p(:))
         Generating copyin(csrvala(:))
         Generating create(z(:))
         Generating copy(x(:),r(:))
         Generating create(s(:),r_tlda(:))
    207, Loop not vectorized/parallelized: potential early exits
    318, Generated 2 alternate versions of the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
         Generated vector simd code for the loop containing reductions
         Generated a prefetch instruction for the loop
    330, Loop not vectorized/parallelized: contains call
    334, Loop not vectorized/parallelized: contains call
pgf90 -c -Minfo -Kieee -O2 -ta=tesla,cuda8.0,cc35,cc60  main.f90
main:
     36, Memory set idiom, loop replaced by call to __c_mset8
     68, Memory zero idiom, loop replaced by call to __c_mzero8
pgf90 -o  a.out precision.obj mmio.obj measure_time.obj mm_convert_csr.obj bicgstab.obj main.obj -Mcuda -acc -ta=tesla,cuda9.0,cc35,cc60  -Mcudalib=cusparse,cublas
				PGI$ make run
===== Program run =====
a.out
 File name for Matrix Market?
lap2D_5pt_n100.mtx
 Matrix Market file name = lap2D_5pt_n100.mtx
 Properties of Matrix_Market matrix
 ====================================
 representation = coordinate
 type           = real
 matrix type    = symmetric
 row (m)        =   10000
 cols(n)        =   10000
 Non-zero elm.  =   49600
 extendSymMatrix=       1
 ==== Triangular Symmetric matrix has been transformed to full matrix form ====
 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            2            1   -1.000000000000000
 i,indx,jndx            3            1            2   -1.000000000000000
 i,indx,jndx            4          101            1   -1.000000000000000
 i,indx,jndx            5            1          101   -1.000000000000000
 i,indx,jndx            6            2            2    4.000000000000000
 i,indx,jndx            7            3            2   -1.000000000000000
 i,indx,jndx            8            2            3   -1.000000000000000
 i,indx,jndx            9          102            2   -1.000000000000000
 i,indx,jndx           10            2          102   -1.000000000000000
 i,indx,jndx        49590         9996         9997   -1.000000000000000
 i,indx,jndx        49591         9997         9997    4.000000000000000
 i,indx,jndx        49592         9998         9997   -1.000000000000000
 i,indx,jndx        49593         9997         9998   -1.000000000000000
 i,indx,jndx        49594         9998         9998    4.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9998         9999   -1.000000000000000
 i,indx,jndx        49597         9999         9999    4.000000000000000
 i,indx,jndx        49598        10000         9999   -1.000000000000000
 i,indx,jndx        49599         9999        10000   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000
 File name for LHS vector? if no available, press ENTER and set as RHS=1.0.
 Xcoosort_bufferSizeExt status =            0
 working space (bytes) =                   794880
 allocation status code for pBuffer =            0
 allocation status code for P =            0
 CreateIdentityPermutation status =            0
 XcoosortByRow status =            0
 i,indx,jndx            1            1            1    4.000000000000000
 i,indx,jndx            2            1            2   -1.000000000000000
 i,indx,jndx            3            1          101   -1.000000000000000
 i,indx,jndx            4            2            1   -1.000000000000000
 i,indx,jndx            5            2            2    4.000000000000000
 i,indx,jndx            6            2            3   -1.000000000000000
 i,indx,jndx            7            2          102   -1.000000000000000
 i,indx,jndx            8            3            2   -1.000000000000000
 i,indx,jndx            9            3            3    4.000000000000000
 i,indx,jndx           10            3            4   -1.000000000000000
 i,indx,jndx        49590         9998         9898   -1.000000000000000
 i,indx,jndx        49591         9998         9997   -1.000000000000000
 i,indx,jndx        49592         9998         9998    4.000000000000000
 i,indx,jndx        49593         9998         9999   -1.000000000000000
 i,indx,jndx        49594         9999         9899   -1.000000000000000
 i,indx,jndx        49595         9999         9998   -1.000000000000000
 i,indx,jndx        49596         9999         9999    4.000000000000000
 i,indx,jndx        49597         9999        10000   -1.000000000000000
 i,indx,jndx        49598        10000         9900   -1.000000000000000
 i,indx,jndx        49599        10000         9999   -1.000000000000000
 i,indx,jndx        49600        10000        10000    4.000000000000000
 == CSR data check : Start from 1  to 10 elements ==
 csrRowPtrA=
          1          4          8         12         16         20         24
         28         32         36
 csrColIndA=
          1          2        101          1          2          3        102
          2          3          4
 csrValA  =
 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01
 -.1000D+01 0.4000D+01 -.1000D+01
 == Start from endpoint-10  to endpoint ==
 csrRowPtrA=
      49562      49566      49570      49574      49578      49582      49586
      49590      49594      49598      49601
 csrColIndA=
       9898       9997       9998       9999       9899       9998       9999
      10000       9900       9999      10000
 csrValA=
 -.1000D+01 -.1000D+01 0.4000D+01 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01
 -.1000D+01 -.1000D+01 -.1000D+01 0.4000D+01
  tol = .1000000000D-12  max_iteration =   10000
 -----------------------------------------------
  Bi-Conjugate Gradient Stabilized (BiCGStab)
 -----------------------------------------------
 nrmr0=    100.0000000000000      tol*nrmr0=   1.0000000000000001E-011
 nrmCK1, nrmCK2 =    238.2158421093953         166.7314075873577
 The analysis and Incomplete-LU factorization phase of ILU(0) :                         0.253 (sec)
 iteration=           81  residual =   6.6059148572793421E-012
 nrmr/nrmr0 =   6.6059148572793405E-014
 Bi-Conjugate Gradient Stabilized (BiCGStab) is PASSED.
 Elapased time for BiGGstab loop with ILU(0) :                                          0.593 (sec)
 error = 0.6593836587854D-11
 x(      1)=0.275607474398E+01
 x(      2)=0.501214948795E+01
 x(      3)=0.696587678467E+01
 x(      4)=0.871021429238E+01
 x(      5)=0.102960547760E+02
 x(      6)=0.117547659017E+02
 x(      7)=0.131074411873E+02
 x(      8)=0.143691932892E+02
 x(      9)=0.155513693040E+02
 x(     10)=0.166627976257E+02
 x(   9990)=0.177105403722E+02
 x(   9991)=0.166627976257E+02
 x(   9992)=0.155513693040E+02
 x(   9993)=0.143691932892E+02
 x(   9994)=0.131074411873E+02
 x(   9995)=0.117547659017E+02
 x(   9996)=0.102960547760E+02
 x(   9997)=0.871021429238E+01
 x(   9998)=0.696587678467E+01
 x(   9999)=0.501214948795E+01
 x(  10000)=0.275607474398E+01