OpenACC ディレクティブによるプログラミング

12 - 1 章 OpenACC 上での CUDA Library の利用

1. cuSOLVERライブラリの紹介 (Dense LAPACK, Sparse LAPACK)

 cuSOLVER ライブラリは、cuBLAS および cuSPARSE ライブラリに基づく高水準パッケージである。それぞれのライブラリは、単独で、または他のツールキットライブラリと連携して使用できる。cuSolver の目的は、密行列、スパース行列に対する最小二乗ソルバ、固有値ソルバのための一般的な行列分解や三角法のような便利な CUDA 版 LAPACK の機能を提供することである。さらに、cuSOLVER は、スパースパターンを共有した行列のシーケンスを解くのに役立つ新しいリファクタリングライブラリも提供する。

 PGI 17.10 現在、cuSOLVER ライブラリへの Fortran Interface (cusolverDn.mod) を提供しているライセンス製品は Linux 用のもののみとなります。Windows 版においてはこの時点では Fortran Interface Module が提供されていないため、ご注意下さい。なお、Fortran interface module を作成すると、Windows でも使用できます。

 cuSOLVER の最初の部分は cuSolverDN と呼ばれ、密行列を対象とした分解や、LU、QR、SVD、LDLT などの解法ルーチン、行列やベクトル置換などの有用なユーティリティを提供している。いわゆる LAPACK ルーチンと同様な機能を提供する。

 次に、cuSolverSP は、スパース QR 分解に基づく新しいスパースルーチンのセットを提供する。すべての行列分解が並列性に優れたスパース性のパターンを持つわけではないため、cuSolverSP ライブラリはシーケンシャルのような行列を扱う CPU パスも提供する。高い並列性を有するマトリックスの場合、GPU パスはより高いパフォーマンスを提供する。ライブラリは、C および C++ から呼び出されるように設計されているが、OpenACC + PGI Fortran Module を介して Fortran からも簡単に利用できる。

 三つ目は、cuSolverRF です。これは、係数のみが変更され、スパース性パターンは同じままである行列のシーケンスを解く際に、非常に優れたパフォーマンスを提供することができるスパース行列用 refactorization パッケージである。(このページでは説明していません)

cuSolverDN ライブラリ(Dense LAPACK)

 cuSolverDN ライブラリは、密行列の線形システムを解くように設計されている。

A x = b

 ここで、係数行列A∈Rnxn、右辺ベクトルb∈Rn、解ベクトルx∈Rn

 cuSolverDNライブラリは、非対称である可能性のある一般的な行列 A を処理するために、QR 分解と LU に部分ピボッティングを提供し、対称/エルミート行列に対してコレスキー分解ルーチンも提供されている。また、対称不定行列の場合、Bunch-Kaufman(LDL)分解を提供している。cuSolverDN ライブラリは、有用な二重対角化ルーチンと特異値分解(SVD)も提供している。

 cuSolverDN ライブラリは、LAPACK の計算集約的で一般的なルーチンを対象とし、LAPACK と互換性のある API を提供していますので、既存のLAPACKを使ったプログラムの移行が容易である。ユーザーは cuSolverDN を使用して、これらの時間の掛かるルーチンを加速し、既存のコードを大幅に変更することなく LAPACK 利用の互換性を保つことができる。

cuSolverSP ライブラリ(Sparse LAPACK)

 cuSolverSPライブラリは、主にスパースな線形システムと最小二乗問題を解くように設計されている。

A x = b

x = argmin || A * z - b ||

 ここで、係数行列A∈Rnxn、右辺ベクトルb∈Rn、解ベクトルx∈Rn。線形システムでは、m = n が必要である。

 コア・アルゴリズムは、スパース QR 分解に基づいている。行列 A は CSR 形式で入力します。行列 A が対称/エルミート行列である場合、ユーザーは完全な行列(full matrix) を提供する必要がある。つまり、欠落している下部または上部を埋める必要がある。もし、行列 A が対称正定値であり、ユーザーが Ax = b を解く必要があるだけであれば、コレスキー分解は機能するが、ユーザーは行列 A の下三角部分を提供することで代替できる。

 線形および最小二乗ソルバーの上に、cuSolverSP ライブラリは、 shift-inverse power method に基づく簡単な固有値ソルバーと、複素平面内のボックスに含まれる固有値の数を数える関数を提供する。

2. cuSolverDN: エルミート正定値行列のコレスキー分解 (LAPACK zpotrf相当)

 cuSOLVER ライブラリは cuBLAS および cuSPARSE ライブラリをベースにしており、密行列、スパース最小二乗ソルバおよび固有値ソルバの行列分解および三角解法ルーチンをサポートし、共分散行列を持つ行列のシーケンスを解くのに役立つリファクタリングライブラリを提供する。 PGI が提供するインターフェイスモジュール(Fortran MODULE)と、PGI 17.7 以降にバンドルされている cuSOLVER ライブラリの PGI コンパイル済みバージョンを使用して、CUDA Fortran および OpenACC Fortran から最適化された cuSolverDN type ルーチン群(cuSolverSP ならびに cuSolverRF type は今後のリリースで対応予定)を呼び出すことができるようになった。この同じ cuSolver ライブラリは、PGI コンパイラを使用して構築されており、また、PGI OpenMP ランタイムと互換性があるため、PGI OpenACC C/C++ からも呼び出すことができる。

 以下の例は、cuSolverDN: dense LAPACK Function を OpenACC Fortran/C++ ならびに CUDA Fortran から利用する場合の一例である。

 配列、変数のデバイスメモリのマネージメントをどちら側のプログラミングモデルで行うかによって、プログラミングの対応は異なるが、ここでは、 CUDA Fortran 側でデバイス配列の宣言を行った場合の方法、OpenACC 側で配列の宣言を行った場合を以下に例示する。

 なお、以下の例は、PGI 17.10 バージョンを使用しての結果である。使用しているバージョンを確かめたい場合は、以下のように -V コマンド・オプションを指定する。

$  pgfortran -V
pgfortran 17.10-0 64-bit target on x86-64 Linux -tp haswell
PGI Compilers and Tools
Copyright (c) 2017, NVIDIA CORPORATION.  All rights reserved.

Dense Linear Solver - cusolverDnZpotrf() with CUDA Fortran

 エルミートの正定値行列のコレスキー分解の一例を示す。LAPACK の ZPOTRF ルーチン相当である。A は n×n のエルミート行列であり、下位または上位の部分のみが意味を持つ。以下の例は CUDA Fortran から cuSOLVER を使用したプログラム例である。PGI Fortran の MODULE インタフェースとして、cublas_v2 と cusolverDn を USE 文で指定することが必要である。入力パラメータ CUBLAS_FILL_MODE_LOWER の場合、 Aの下三角部分のみが処理され、下三角のコレスキー係数 L で置き換えられます。(A = L * L H) 入力パラメータ Workspace で指し示す作業領域を提供する必要がある。入力パラメータ Lwork は作業領域のサイズであり、 potrf_bufferSize()によって返される。なお、密行列 A は、column-major order で入力する必要がある。

CUDA Fortran + cuSOLVER(cusolverDnZpotrf)

! Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
!
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
program main
  use cublas_v2
  use cusolverDn
  use cudafor
  implicit none
  integer, parameter :: n=3
  complex(8) :: a(n,n)
  complex(8), device :: a_d(n,n)
  complex(8), device, allocatable :: workspace_d(:)
  integer, device :: devInfo_d
  integer :: istat, Lwork
  type(cusolverDnHandle) :: h

  a(1,1) = 25.0;   a(1,2) = 15.0;   a(1,3) = -5.0
  a(2,1) = a(1,2); a(2,2) = 18.0;   a(2,3) = 0.0
  a(3,1) = a(1,3); a(3,2) = a(2,3); a(3,3) = 11.0

  a_d = a

! handle 作成
  istat = cusolverDnCreate(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle creation failed'
! Working buffer サイズ計算
  istat = cusolverDnZpotrf_bufferSize(h, &
       CUBLAS_FILL_MODE_LOWER, n, a_d, n, Lwork)

  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'cusolverDnZpotrf_buffersize failed'

  allocate(workspace_d(Lwork))
! コレスキー分解
  istat = cusolverDnZpotrf(h, CUBLAS_FILL_MODE_LOWER, &
       n, a_d, n, workspace_d, Lwork, devInfo_d)

  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'cusolverDnZpotrf failed'

  istat = devInfo_d
  if (istat /= 0) write(*,*) 'Cholesky factorization failed'

  istat = cusolverDnDestroy(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle destruction failed'

  a = a_d

  write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(1,:)
  write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(2,:)
  write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(3,:)
end program main

 コンパイルオプションとして、-acc -Mcudalib=cusolver,cublas、-Mcuda オプションを付けてコンパイルリンクする必要がある。

コンパイル&実行

$ pgfortran -fast -Minfo -Mcuda=cc60,cc35,cuda8.0 -Mcudalib=cusolver,cublas  testDn.cuf
main:
     60, Loop unrolled 3 times (completely unrolled)
     61, Loop unrolled 3 times (completely unrolled)
     62, Loop unrolled 3 times (completely unrolled)
$ ./a.out
5.+0.i  +15.+0.i  -5.+0.i
3.+0.i  +3.+0.i  +0.+0.i
-1.+0.i  +1.+0.i  +3.+0.i

Dense Linear Solver - cusolverDnZpotrf() with OpenACC Fortran

 上記の CUDA Fortran プログラムを OpenACC を使って書き換える。PGI Fortran の MODULE インタフェースとして、cublas_v2 と cusolverDn を USE 文で指定することが必要である。デバイス側のメモリ割付を気にする必要なく、OpenACCのディレクティブのみで、CUDA 数学ライブラリが利用できる。既存の Fortran プログラムをポーティングすることも簡単にできるということである。

 OpenACC のディレクティブで留意すべきことは、CUDA Fortran or CUDA C で記述された(ライブラリ)ルーチンをコールする際に渡す実引数が「デバイスポインタ」である場合、これをコンパイルに伝えるために、host_data use_device() を使う。このディレクティブの使い方さえ誤りがなければ、既存のプログラムベースのものを OpenACC を使ってポーティングすることは大きな負担ではないはずだ。

OpenACC Fortran + cuSOLVER(cusolverDnZpotrf)

! Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved. 
!
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
program main
  use cublas_v2
  use cusolverDn
  use cudafor
  implicit none
  integer, parameter :: n=3
  complex(8) :: a(n,n)
!!! complex(8), device :: a_d(n,n)            ! デバイス変数の宣言必要なし
!!! complex(8), device, allocatable :: workspace_d(:)
!!! integer, device :: devInfo_d
  complex(8), allocatable :: workspace_d(:)
  integer :: devInfo_d
  integer :: istat, Lwork
  type(cusolverDnHandle) :: h

  a(1,1) = 25.0  ; a(1,2) = 15.0;   a(1,3) = -5.0
  a(2,1) = a(1,2); a(2,2) = 18.0;   a(2,3) = 0.0
  a(3,1) = a(1,3); a(3,2) = a(2,3); a(3,3) = 11.0

!!! a_d = a

  istat = cusolverDnCreate(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle creation failed'

!$acc data copy(a) create(workspace_d)

!$acc host_data use_device(a)
  istat = cusolverDnZpotrf_bufferSize(h, &
       CUBLAS_FILL_MODE_LOWER, n, a, n, Lwork)
!$acc end host_data

  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'cusolverDnZpotrf_buffersize failed'
  print *, "working space (bytes) =", Lwork
  allocate(workspace_d(Lwork))

!$acc enter data copyin(workspace_d) copyin(devInfo_d)

!$acc host_data use_device(a, workspace_d, devInfo_d)
  istat = cusolverDnZpotrf(h, CUBLAS_FILL_MODE_LOWER, &
       n, a, n, workspace_d, Lwork, devInfo_d)
!$acc end host_data

  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'cusolverDnZpotrf failed'

!$acc exit data copyout(devInfo_d)
!$acc exit data delete(workspace_d)
!$acc end data

  istat = devInfo_d
  if (istat /= 0) write(*,*) 'Cholesky factorization failed'


  istat = cusolverDnDestroy(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle destruction failed'

!!! a = a_d

  write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(1,:)
  write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(2,:)
  write(*,"(3(f0.0,SP,f0.0,'i',2x))") a(3,:)
end program main

 コンパイルオプションとして、-acc、-Mcudalib=cusolver,cublas、-Mcuda オプションを付けてコンパイルリンクする必要がある。OpenACC の -acc オプションの他に、-Mcuda オプションは、CUDA関係のシステムライブラリ・リストをリンカーに伝えるために必要である。

コンパイル&実行

$ pgf90 -fast -Minfo -acc -ta=tesla,cc60,cc35,cuda8.0 -Mcudalib=cusolver,cublas openacc.f90 -Mcuda
main:
     48, Generating copy(a(:,:))
         Generating create(workspace_d(:))
     60, Generating enter data copyin(devinfo_d,workspace_d(:))
     70, Generating exit data copyout(devinfo_d)
     71, Generating exit data delete(workspace_d(:))
     84, Loop unrolled 3 times (completely unrolled)
     85, Loop unrolled 3 times (completely unrolled)
     86, Loop unrolled 3 times (completely unrolled)
$ ./a.out
 working space (bytes) =            1
5.+0.i  +15.+0.i  -5.+0.i
3.+0.i  +3.+0.i  +0.+0.i
-1.+0.i  +1.+0.i  +3.+0.i

Dense Linear Solver - cusolverDnZpotrf with OpenACC C++

上記 Fortran プログラムを NVIDIA nvcc コンパイラではなく、PGI C++ OpenACC を使用して書き直す。C/C++ プログラムにおいても、OpenACC と NVIDIA CUDA ライブラリを使用するプログラムを作成できる。なお、以下の例は NVIDIA cuSOLVE ライブラリを C++ プログラムから参照する際に、複素数型は、CUDA 複素数型(例えば、cuDoubleComplex等)を使用して宣言する必要がある。C++の std:complex テンプレートのcomplex で宣言を行うと cuSOLVE ルーチンの引数の型マッチングができなかったので、CUDA 複素数型を使うのが無難である。また、cuSOLVER の「密行列」の数値の配置は column-major order(Fortranタイプ)である。

OpenACC + cusolverDnDsyevd (DnZpotrf.cpp)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.

/*
 * How to compile (assume cuda is installed at /usr/local/cuda-9.0/)
 *   pgc++ -I/usr/local/cuda-9.0/include DnZpotrf.cpp -acc -ta=tesla,cc60,cc35 -Mcudalib=cusolver -Minfo=accel -O2 
 *  
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cusolverDn.h>

void printD_CMatrix(int m, int n, const cuDoubleComplex *A, int lda, const char* name)
{
    for(int row = 0 ; row < m ; row++){
        for(int col = 0 ; col < n ; col++){
            cuDoubleComplex Areg = A[row + col*lda];
            printf("%s(%d,%d) = %f  %f\n", name, row+1, col+1, cuCreal(Areg), cuCimag(Areg));
        }
    }
}

int main(int argc, char*argv[])
{
    cusolverDnHandle_t cusolverH = NULL;
    cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS;
    const int m = 3;
    const int lda = m;

/*       | 25 15 -5 |
 *   A = | 15 18  0 |
 *       | -5  0 11 |
 *
 */
    cuDoubleComplex *A;  // input

    int *devInfo = NULL;
    cuDoubleComplex *work = NULL;
    int  lwork = 0;

    A = (cuDoubleComplex *) malloc (sizeof(cuDoubleComplex) * lda * m);
    devInfo = (int *) malloc (sizeof(int));

// step 1: create cusolver/cublas handle
    cusolver_status = cusolverDnCreate(&cusolverH);

// step 2: set A
    A[0] = make_cuDoubleComplex(25.0,0.0);
    A[3] = make_cuDoubleComplex(15.0,0.0);
    A[6] = make_cuDoubleComplex(-5.0,0.0);
    A[1] = A[3];
    A[4] = make_cuDoubleComplex(18.0,0.0);
    A[7] = make_cuDoubleComplex(0.0,0.0);
    A[2] = A[6];
    A[5] = A[7];
    A[8] = make_cuDoubleComplex(11.0,0.0);

    printD_CMatrix(m, m, A, lda, "A");
    printf("=====\n");

// step 3: query working space of syevd
    cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;

#pragma acc data copy(A[0:lda*m], devInfo[0:1])
{
#pragma acc host_data use_device(A)
  {
    cusolver_status = cusolverDnZpotrf_bufferSize(
        cusolverH,
        uplo,
        m,
        A,
        lda,
        &lwork);
  }
    assert (cusolver_status == CUSOLVER_STATUS_SUCCESS);

    work = (cuDoubleComplex *) malloc(sizeof(cuDoubleComplex)*lwork);
    printf("Lwork = %d\n",lwork);

#pragma acc enter data create(work[0:lwork])

// step 4: compute spectrum
#pragma acc host_data use_device(A, work, devInfo)
  {
    cusolver_status = cusolverDnZpotrf(
        cusolverH,
        uplo,
        m,
        A,
        lda,
        work,
        lwork,
        devInfo);
  }
#pragma acc wait
#pragma acc exit data delete(work)
}
    printf("DevInfo = %d\n",devInfo[0]);

    printf("A = (base 1)\n");
    printD_CMatrix(m, m, A, lda, "A_result");
    printf("=====\n");

// free resources
    if ( A     ) free(A);
    if (devInfo) free(devInfo);
    if (work   ) free(work);

    if (cusolverH) cusolverDnDestroy(cusolverH);

    return 0;
}

 PGI C/C++ OpenACC コンパイラで cuSOLVER を使用する場合は、NVIDIA CUDA Toolkit にバンドルされている C/C++ 用 include ファイルを使用してコンパイルする(PGI コンパイラ内には、C/C++ 用の cuSOLVER/cuBLAS ライブラリ等の include ファイルはバンドルされていない) 。ここでの例は CUDA 9.0 が /usr/local/cuda-9.0 配下に実装されているものとして記述する。コンパイルオプションとして、 -I/usr/local/cuda-9.0/include -acc -ta=tesla -Mcudalib=cusolver オプションを付けてコンパイルリンクする必要がある。
(CUDA 8.0 の場合)
 pgc++ -I/usr/local/cuda-8.0/include DnZpotrf.cpp -acc -ta=tesla,cc60,cc35 -Mcudalib=cusolver -Minfo=accel -O2
(CUDA 9.0 の場合)
 pgc++ -I/usr/local/cuda-9.0/include DnZpotrf.cpp -acc -ta=tesla,cc60,cc35 -Mcudalib=cusolver -Minfo=accel -O2

コンパイル&実行

[kato@photon32 C]$ pgc++ -I/usr/local/cuda-9.0/include DnZpotrf.cpp -acc -ta=tesla,cc60,cc35 
-Mcudalib=cusolver -Minfo=accel -O2
main:
     64, Generating copy(A[:9],devInfo[:1])
     84, Generating enter data create(work[:lwork])
     97, Generating exit data delete(work[:1])
[kato@photon32 C]$ a.out
A(1,1) = 25.000000  0.000000
A(1,2) = 15.000000  0.000000
A(1,3) = -5.000000  0.000000
A(2,1) = 15.000000  0.000000
A(2,2) = 18.000000  0.000000
A(2,3) = 0.000000  0.000000
A(3,1) = -5.000000  0.000000
A(3,2) = 0.000000  0.000000
A(3,3) = 11.000000  0.000000
=====
Lwork = 1
DevInfo = 0
A = (base 1)
A(1,1) = 5.000000  0.000000
A(1,2) = 15.000000  0.000000
A(1,3) = -5.000000  0.000000
A(2,1) = 3.000000  0.000000
A(2,2) = 3.000000  0.000000
A(2,3) = 0.000000  0.000000
A(3,1) = -1.000000  0.000000
A(3,2) = 1.000000  0.000000
A(3,3) = 3.000000  0.000000
=====

3. cuSolverDN: 対称正定値行列のコレスキー分解を使った線形ソルバー (LAPACK Dpotrs相当)

 対称正定値行列をコレスキー分解を使用して解く倍精度型ソルバー(cusolverDnDpotrs)の利用例である。コレスキー分解は、cusolverDnDpotrf ルーチンを使用し、cusolverDnDpotrs を使用して連立一次方程式を解く。なお、R.H.S(右辺) は、複数のベクトルを与えて解く。なお、密行列 A は、 column-major order で入力する必要がある。

OpenACC Fortran + cusolverDnDpotrf, cusolverDnDpotrs (Solver.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
program main
  use cublas_v2
  use cusolverDn
  use cudafor
  implicit none
  integer, parameter :: n=5, nrhs=3
  integer, parameter :: lda=n, ldb=n
  integer ::  i
  real(8) :: a(lda,n)
  real(8) :: b(ldb, nrhs)
  real(8), allocatable :: workspace_d(:)
  integer :: devInfo_d
  integer :: istat, Lwork
  type(cusolverDnHandle) :: h

  data a /3.14, 0.00, 0.00, 0.00, 0.00, &
          0.17, 0.79, 0.00, 0.00, 0.00, &
         -0.90, 0.83, 4.53, 0.00, 0.00, &
          1.65,-0.65,-3.70, 5.32, 0.00, &
         -0.72, 0.28, 1.60,-1.37, 1.98/

  data b /-7.29, 9.25, 5.99,-1.94,-8.30, &
           6.11, 2.90,-5.05,-3.80, 9.66, &
           0.59, 8.88, 7.57, 5.57,-1.67/

  print *, "Matrix A"
  do i = 1, n
  write(*,"(5(f8.2,2x))") a(i,:)
  end do
  print *, "RHS "
  do i = 1, n
  write(*,"(5(f8.2,2x))") b(i,:)
  end do

  istat = cusolverDnCreate(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle creation failed'

!$acc data copy(a,b) create(workspace_d)
  !$acc host_data use_device(a)
    istat = cusolverDnDpotrf_bufferSize(h, &
         CUBLAS_FILL_MODE_LOWER, n, a, n, Lwork)
  !$acc end host_data

    if (istat /= CUSOLVER_STATUS_SUCCESS) &
         write(*,*) 'cusolverDnDpotrf_buffersize failed'
    print *, "working space (bytes) =", Lwork
    allocate(workspace_d(Lwork))

  !$acc enter data copyin(workspace_d) copyin(devInfo_d)

  !$acc host_data use_device(a, b, workspace_d, devInfo_d)
    istat = cusolverDnDpotrf(h, CUBLAS_FILL_MODE_UPPER, &
             n, a, lda, workspace_d, Lwork, devInfo_d)
    istat = cusolverDnDpotrs(h, CUBLAS_FILL_MODE_UPPER, &
             n, nrhs, a, lda, b, ldb, devInfo_d)
  !$acc end host_data

    if (istat /= CUSOLVER_STATUS_SUCCESS) &
         write(*,*) 'cusolverDnDpotrs failed'

  !$acc exit data copyout(devInfo_d)
  !$acc exit data delete(workspace_d)
!$acc end data

  istat = devInfo_d
  if (istat /= 0) write(*,*) 'Cholesky Solver failed'

  istat = cusolverDnDestroy(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle destruction failed'

  print *, "Cholesky factorization"
  do i = 1, n
  write(*,"(5(f8.2,2x))") a(i,:)
  end do
  print *, "Solution"
  do i = 1, n
  write(*,"(5(f8.2,2x))") b(i,:)
  end do
end program main

$ pgf90 -c -O2 -Minfo -ta=tesla,cuda9.0,cc35,cc60 -Mcudalib=cusolver,cublas  -Kieee solver.f90
main:
     47, Loop not vectorized/parallelized: contains call
     48, Loop unrolled 5 times (completely unrolled)
     51, Loop not vectorized/parallelized: contains call
     52, Loop unrolled 3 times (completely unrolled)
     59, Generating copy(a(:,:))
         Generating create(workspace_d(:))
         Generating copy(b(:,:))
     70, Generating enter data copyin(devinfo_d,workspace_d(:))
     82, Generating exit data copyout(devinfo_d)
     83, Generating exit data delete(workspace_d(:))
     94, Loop not vectorized/parallelized: contains call
     95, Loop unrolled 5 times (completely unrolled)
     98, Loop not vectorized/parallelized: contains call
     99, Loop unrolled 3 times (completely unrolled)
$ pgf90 -o  a.out solver.o -Mcuda -acc -ta=tesla,cuda9.0,cc35,cc60  -Mcudalib=cusolver,cublas
$ a.out
 Matrix A
    3.14      0.17     -0.90      1.65     -0.72
    0.00      0.79      0.83     -0.65      0.28
    0.00      0.00      4.53     -3.70      1.60
    0.00      0.00      0.00      5.32     -1.37
    0.00      0.00      0.00      0.00      1.98
 RHS
   -7.29      6.11      0.59
    9.25      2.90      8.88
    5.99     -5.05      7.57
   -1.94     -3.80      5.57
   -8.30      9.66     -1.67
 working space (bytes) =            2
 Cholesky factorization
    1.77      0.10     -0.51      0.93     -0.41
    0.00      0.88      0.99     -0.84      0.36
    0.00      0.00      1.81     -1.32      0.57
    0.00      0.00      0.00      1.42      0.05
    0.00      0.00      0.00      0.00      1.16
 Solution
   -6.02      3.95     -3.14
   15.62      4.32     13.05
    3.02     -8.25      4.91
    3.25     -4.83      6.11
   -8.78      9.04     -3.57

4. cuSolverDN: LU分解を使った線形システムソルバー (LAPACK Dgetrs 相当)

 Dense Matrix を対象とした LU 分解を使用して解く倍精度型線形ソルバー(cusolverDnDgetrs)の利用例である。コレスキー分解は、cusolverDngetrf ルーチンを使用し、cusolverDnSgetrs を使用して連立一次方程式を解く。なお、R.H.S(右辺) は、複数のベクトルを与えて解く。なお、密行列 A は、 column-major order で入力する必要がある。

OpenACC Fortran + cusolverDnDpotrf, cusolverDnDpotrs (Solver.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
program main
  use cublas_v2
  use cusolverDn
  use cudafor
  implicit none
  integer, parameter :: m=5, n=3, nrhs=2
  integer, parameter :: lda=m, ldb=m
  integer ::  i, max_mn
  real(8) :: a(lda,n)
  real(8) :: b(ldb, nrhs)
  integer, allocatable :: devIpiv(:)
  real(8), allocatable :: workspace_d(:)
  integer :: devInfo_d
  integer :: istat, Lwork
  type(cusolverDnHandle) :: h

  data a /1.00, 2.00, 3.00, 4.0, 5.0,&
          1.00, 3.00, 5.00, 2.0, 4.0,&
          1.00, 4.00, 2.00, 5.0, 3.0/

  data b /-10.0, 12.00, 14.0, 16.0, 18.0,&
           -3.0, 14.0, 12.0, 16.0,16.0/

  print *, "Matrix A"
  do i = 1, m
  write(*,"(5(f8.2,2x))") a(i,:)
  end do
  print *, "RHS "
  do i = 1, m
  write(*,"(5(f8.2,2x))") b(i,:)
  end do

  max_mn = max (m, n)
  allocate (devIpiv(max_mn))

  istat = cusolverDnCreate(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle creation failed'

!$acc data copy(a,b) create(workspace_d) copyout(devIpiv)
  !$acc host_data use_device(a)
    istat = cusolverDnDgetrf_bufferSize(h, &
         m, n, a, lda, Lwork)
  !$acc end host_data

    if (istat /= CUSOLVER_STATUS_SUCCESS) &
         write(*,*) 'cusolverDnDgetrf_buffersize failed'
    print *, "working space (bytes) =", Lwork
    allocate(workspace_d(Lwork))

  !$acc enter data copyin(workspace_d) copyin(devInfo_d)

  !$acc host_data use_device(a, b, workspace_d, devInfo_d, devIpiv)
    istat = cusolverDnDgetrf(h, &
             m, n, a, lda, workspace_d, devIpiv, devInfo_d)
    istat = cusolverDnDgetrs(h, CUBLAS_OP_N, &
             n, nrhs, a, lda, devIpiv, b, ldb, devInfo_d)
  !$acc end host_data

    if (istat /= CUSOLVER_STATUS_SUCCESS) &
         write(*,*) 'cusolverDnDgetrs failed'

  !$acc exit data copyout(devInfo_d)
  !$acc exit data delete(workspace_d)
!$acc end data

  istat = devInfo_d
  if (istat /= 0) write(*,*) 'LU Solver failed'

  istat = cusolverDnDestroy(h)
  if (istat /= CUSOLVER_STATUS_SUCCESS) &
       write(*,*) 'handle destruction failed'

  print *, "LU factorization"
  do i = 1, m
  write(*,"(5(f8.2,2x))") a(i,:)
  end do
  print *, "Pivot Indcies"
  print *, devIpiv(:)
  print *, "Solution"
  do i = 1, m
  write(*,"(5(f8.2,2x))") b(i,:)
  end do
end program main

$ pgf90 -c -O2 -Minfo -ta=tesla,cuda9.0,cc35,cc60 -Mcudalib=cusolver,cublas -Kieee solver.f90
main:
     45, Loop not vectorized/parallelized: contains call
     46, Loop unrolled 3 times (completely unrolled)
     49, Loop not vectorized/parallelized: contains call
     50, Loop unrolled 2 times (completely unrolled)
     60, Generating copy(a(:,:))
         Generating copyout(devipiv(:))
         Generating create(workspace_d(:))
         Generating copy(b(:,:))
     71, Generating enter data copyin(devinfo_d,workspace_d(:))
     83, Generating exit data copyout(devinfo_d)
     84, Generating exit data delete(workspace_d(:))
     95, Loop not vectorized/parallelized: contains call
     96, Loop unrolled 3 times (completely unrolled)
    101, Loop not vectorized/parallelized: contains call
    102, Loop unrolled 2 times (completely unrolled)
pgf90 -o  a.out solver.o -Mcuda -acc -ta=tesla,cuda9.0,cc35,cc60  -Mcudalib=cusolver,cublas
$ a.out
 Matrix A
    1.00      1.00      1.00
    2.00      3.00      4.00
    3.00      5.00      2.00
    4.00      2.00      5.00
    5.00      4.00      3.00
 RHS
  -10.00     -3.00
   12.00     14.00
   14.00     12.00
   16.00     16.00
   18.00     16.00
 working space (bytes) =           18
 LU factorization
    5.00      4.00      3.00
    0.60      2.60      0.20
    0.40      0.54      2.69
    0.80     -0.46      1.00
    0.20      0.08      0.14
 Pivot Indcies
            5            3            3            0            0
 Solution
    2.00      1.20
    1.14      0.74
    1.14      2.34
   16.00     16.00
  -10.00     -3.00

5. cuSolverDN: 対称行列の固有値ソルバー (LAPACK Dsyevd相当)

 対称性のある密行列に対する固有値、固有ベクトルを求めるためのソルバー(cusolverDnDsyevd)を使用する。以下の例は、D.1. Standard Symmetric Dense Eigenvalue Solvern と同じ処理を Fortran OpenACC ならびに C++ OpenACC で書いたものである。

Dense Linear Solver - cusolverDnDsyevd with OpenACC Fortran

 対称(エルミート)n×n 行列 A の固有値と固有ベクトルを計算します。標準対称固有値問題は、以下の式で表される。

A * V = V * Λ

ここで、Λは実際の n×n 対角行列です。 V は n×n ユニタリ行列である。 Λの対角要素は、A の昇順の固有値です。cusolverDnDsyevd ソルバーは、予め作業空間(d_work)を提供する必要があるため、syevd_bufferSize 関数で、lwork サイズを求める必要がある。出力パラメータの devInfo = -i(ゼロより小さい)の場合、i 番目のパラメータは間違っていることを示す。 devInfo = i(零より大きい)場合は、内部計算時に三重対角の i 個のオフ対角要素がゼロに収束しなかったことを表す。なお、密行列 A は、 column-major order で入力する必要がある。

OpenACC + cusolverDnDsyevd (syevd.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.

! pgf90 -acc -O2 -Minfo -ta=tesla,cc60,cc35 -Mcuda -Mcudalib=cusolver syevd.f90

program main
  use cusolverDn
  implicit none
  integer,parameter :: prc = SELECTED_REAL_KIND(15,305)

  integer, parameter :: idbg = 1 ! debug write enable
  integer, parameter :: m = 3, lda =m
  type(cusolverDnHandle) :: cusolver_H
  integer                :: cusolver_status
  integer                :: status1, ierr_code
  integer                :: i, j

!       | 3.5 0.5 0 |
!   A = | 0.5 3.5 0 |
!       | 0   0   2 |
!
!   lambda = (/ 2.0, 3.0, 4.0 /)  exact eigenvalues

  real(prc), allocatable, dimension(:) :: A
  real(prc), allocatable, dimension(:) :: lambda


  real(prc), allocatable, dimension(:) ::  V  ! eigenvectors
  real(prc), allocatable, dimension(:) ::  W  ! eigenvaluse
  real(prc), allocatable, dimension(:) ::  d_work
  integer :: devInfo
  integer :: Lwork

  allocate ( A(lda*m), V(lda*m), W(m) )
  allocate ( lambda(m) )

  Lwork = 0

! set a data
  A(1:lda*m)  = (/3.5, 0.5, 0, 0.5, 3.5, 0, 0, 0, 2.0/)

! step 1 create cusolver/cublas handle
  cusolver_status = cusolverDnCreate(cusolver_H)
    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step1 cusolver_status = ", cusolver_status
    end if

!$acc data copy(A, devInfo) copyout(W)
!$acc host_data use_device(A, W)

! step 2 working space of syevd
  cusolver_status = cusolverDnDsyevd_bufferSize(cusolver_H, CUSOLVER_EIG_MODE_VECTOR, &
                    CUSOLVER_EIG_MODE_VECTOR, m, A, lda, W, Lwork)
!$acc end host_data

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step2 cusolver_status = ", cusolver_status
    end if

! step 3 d_work buffer allocated
    allocate(d_work(Lwork), stat=ierr_code)  ! words
      if (idbg == 1) print *, "step5: alloc ierr_code = ", ierr_code

!$acc enter data copyin(d_work)

! step 4 compute spectrum = eigen{vectors, valuse}
!$acc host_data use_device(A, W, d_work, devInfo)
  cusolver_status = cusolverDnDsyevd(cusolver_H,  CUSOLVER_EIG_MODE_VECTOR, &
                    CUSOLVER_EIG_MODE_VECTOR, m, A, lda, W, d_work, Lwork, devInfo)
!$acc end host_data
!$acc exit data delete(d_work)
!$acc end data

  print *, "Output parameter devInfo =", devInfo

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step4 cusolver_status = ", cusolver_status
    end if

! copy [A].output to [V] (eigenvectors)
  V = A

! atep 5 print out
  ! Eigenvalues
  print *, "Eigenvalue : "
  do i = 1, m
    print *, "W(",i,")  = ", w(i)
  end do

  ! Eigenvectors
  print *, "Eigenvector : "
  do i = 1, m
    do j = 1, m
       print '(1x,a,i5,a,i5,1x,a4,F15.7)', "(",i,",",j,") = ", V(i+(j-1)*lda)
    end do
  end do


  deallocate(A, V, W, d_work)

  end program main

 コンパイルオプションとして、-acc -Mcudalib=cusolver -Mcuda オプションを付けてコンパイルリンクする必要がある。

コンパイル&実行

[kato@photon32 Dens_Eigen]$ pgf90 -acc -O2 -Minfo -ta=tesla,cc60,cc35 -Mcuda -Mcudalib=cusolver syevd.f90
main:
     57, Generated vector simd code for the loop
         Residual loop unrolled 1 times (completely unrolled)
     65, Generating copy(a(:))
         Generating copyout(w(:))
         Generating copy(devinfo)
     81, Generating enter data copyin(d_work(:))
     88, Generating exit data delete(d_work(:))
     98, Generated vector simd code for the loop
         Residual loop unrolled 1 times (completely unrolled)
    103, Loop not vectorized/parallelized: contains call
    110, Loop not vectorized/parallelized: contains call
[kato@photon32 Dens_Eigen]$ a.out
 step5: alloc ierr_code =             0
 Output parameter devInfo =            0
 Eigenvalue :
 W(            1 )  =     2.000000000000000
 W(            2 )  =     3.000000000000000
 W(            3 )  =     4.000000000000000
 Eigenvector :
 (    1,    1 ) =       0.0000000
 (    1,    2 ) =      -0.7071068
 (    1,    3 ) =       0.7071068
 (    2,    1 ) =       0.0000000
 (    2,    2 ) =       0.7071068
 (    2,    3 ) =       0.7071068
 (    3,    1 ) =       1.0000000
 (    3,    2 ) =       0.0000000
 (    3,    3 ) =       0.0000000

Dense Linear Solver - cusolverDnDsyevd with OpenACC C++

上記 Fortran プログラムを PGI C++ OpenACC を使用して書き直す。

OpenACC + cusolverDnDsyevd (syevd.cpp)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.

/*
 * How to compile (assume cuda is installed at /usr/local/cuda-9.0/)
 *  pgc++ -I/usr/local/cuda-9.0/include syevd.cpp -acc -ta=tesla,cc60,cc35 -Mcudalib=cusolver -Minfo=accel -O2
 *
 */

#include <stdio.h>
#include <stdlib.h>
#include <assert.h>
#include <cusolverDn.h>

void printMatrix(int m, int n, const double*A, int lda, const char* name)
{
    for(int row = 0 ; row < m ; row++){
        for(int col = 0 ; col < n ; col++){
            double Areg = A[row + col*lda];
            printf("%s(%d,%d) = %f\n", name, row+1, col+1, Areg);
        }
    }
}

int main(int argc, char*argv[])
{
    cusolverDnHandle_t cusolverH = NULL;
    cusolverStatus_t cusolver_status = CUSOLVER_STATUS_SUCCESS;
    const int m = 3;
    const int lda = m;

/*       | 3.5 0.5 0 |
 *   A = | 0.5 3.5 0 |
 *       | 0   0   2 |
 *
 */
//  double lambda[m] = { 2.0, 3.0, 4.0};   /* Exact eigenvalues */
    double *A;  // ibput
    double *V;  // eigenvectors
    double *W;  // eigenvalues

    int *devInfo = NULL;
    double *work = NULL;
    int  lwork = 0;

    A = (double *) malloc (sizeof(double) * lda * m);
    V = (double *) malloc (sizeof(double) * lda * m);
    W = (double *) malloc (sizeof(double) * m);
    devInfo = (int *) malloc (sizeof(int));

// step 1: create cusolver/cublas handle
    cusolver_status = cusolverDnCreate(&cusolverH);

// step 2: set A
    A[0] = 3.5 ; A[1] = 0.5 ; A[2] = 0.0;
    A[3] = 0.5;  A[4] = 3.5 ; A[5] = 0.0;
    A[6] = 0.0;  A[7] = 0.0 ; A[8] = 2.0;

    printMatrix(m, m, A, lda, "A");
    printf("=====\n");

// step 3: query working space of syevd
    cusolverEigMode_t jobz = CUSOLVER_EIG_MODE_VECTOR; // compute eigenvalues and eigenvectors.
    cublasFillMode_t uplo = CUBLAS_FILL_MODE_LOWER;

#pragma acc data copy(A[0:lda*m], devInfo[0:1]) copyout(W[0:m])
{
#pragma acc host_data use_device(A, W)
  {
    cusolver_status = cusolverDnDsyevd_bufferSize(
        cusolverH,
        jobz,
        uplo,
        m,
        A,
        lda,
        W,
        &lwork);
  }
    assert (cusolver_status == CUSOLVER_STATUS_SUCCESS);

    work = (double *) malloc(sizeof(double)*lwork);
    printf("Lwork = %d\n",lwork);

#pragma acc enter data create(work[0:lwork])

// step 4: compute spectrum
#pragma acc host_data use_device(A, W, work, devInfo)
  {
    cusolver_status = cusolverDnDsyevd(
        cusolverH,
        jobz,
        uplo,
        m,
        A,
        lda,
        W,
        work,
        lwork,
        devInfo);
  }
#pragma acc wait
#pragma acc exit data delete(work)
}
    printf("DevInfo = %d\n",devInfo[0]);
    printf("eigenvalue = (base 1), ascending order\n");
    for(int i = 0 ; i < m ; i++){
        printf("W[%d] = %E\n", i+1, W[i]);
    }

// print eigenvecvtors
    for(int i = 0 ; i < m*m ; i++){
      V[i] = A[i];
    }
    printf("V = (base 1)\n");
    printMatrix(m, m, V, lda, "V");
    printf("=====\n");

// free resources
    if (  A    ) free(A);
    if (  W    ) free(W);
    if (devInfo) free(devInfo);
    if (work )   free(work);

    if (cusolverH) cusolverDnDestroy(cusolverH);

//  cudaDeviceReset();

    return 0;
}

 PGI C/C++ OpenACC コンパイラで cuSOLVER を使用する場合は、NVIDIA CUDA Toolkit にバンドルされている C/C++ 用 include ファイルを使用してコンパイルする(PGI コンパイラ内には、C/C++ 用の cuSOLVER/cuBLAS ライブラリ等の include ファイルはバンドルされていない) 。ここでの例は CUDA 9.0 が /usr/local/cuda-9.0 配下に実装されているものとして記述する。コンパイルオプションとして、 -I/usr/local/cuda-9.0/include -acc -ta=tesla -Mcudalib=cusolver オプションを付けてコンパイルリンクする必要がある。
(CUDA 8.0 の場合)
 pgc++ -I/usr/local/cuda-8.0/include syevd.cpp -acc -ta=tesla,cc60,cc35 -Mcudalib=cusolver -Minfo=accel -O2
(CUDA 9.0 の場合)
 pgc++ -I/usr/local/cuda-9.0/include syevd.cpp -acc -ta=tesla,cc60,cc35 -Mcudalib=cusolver -Minfo=accel -O2

コンパイル&実行

[kato@photon32 C]$ pgc++ -I/usr/local/cuda-9.0/include syevd.cpp -acc -ta=tesla,cc60,cc35 
-Mcudalib=cusolver -Minfo=accel -O2
main:
     65, Generating copyout(W[:3])
         Generating copy(devInfo[:1],A[:9])
     87, Generating enter data create(work[:lwork])
    103, Generating update self(devInfo[:1])
         Generating exit data delete(work[:1])
[kato@photon32 C]$ a.out
A(1,1) = 3.500000
A(1,2) = 0.500000
A(1,3) = 0.000000
A(2,1) = 0.500000
A(2,2) = 3.500000
A(2,3) = 0.000000
A(3,1) = 0.000000
A(3,2) = 0.000000
A(3,3) = 2.000000
=====
Lwork = 83
DevInfo = 0
eigenvalue = (base 1), ascending order
W[1] = 2.000000E+00
W[2] = 3.000000E+00
W[3] = 4.000000E+00
V = (base 1)
V(1,1) = 0.000000
V(1,2) = -0.707107
V(1,3) = 0.707107
V(2,1) = 0.000000
V(2,2) = 0.707107
V(2,3) = 0.707107
V(3,1) = 1.000000
V(3,2) = 0.000000
V(3,3) = 0.000000
=====

6. cuSolverDN: QR分解による密行列線形ソルバー (LAPACK Dgeqrf 相当)

 密行列線形システムを解くために QR 分解を利用する例を示す。以下の例は、C.1. QR Factorization Dense Linear Solver と同じ処理を Fortran OpenACC で書いたものである。なお、密行列 A は、 column-major order で入力する必要がある。

QR Factorization Dense Linear Solver with OpenACC Fortran

 以下の線形システムを QR 分解を用いて解く。

A x = b

 以下のプログラムでは、3 x 3 密行列(正則行列)の例を示す。
A = 1.0 2.0 3.0 4.0 5.0 6.0 2.0 1.0 1.0

 cuSOLVEルーチン(geqrf,ormqr)、cuBLAS ルーチン(trsm)を利用する。

Step 1: A = Q*R by geqrf.
Step 2: B := Q^T*B by ormqr.
Step 3: solve R*X = B by trsm.

OpenACC + cuSOLVE-geqrf,ormqr,trsm (ormqr.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.

! pgf90 -acc -O2 -Minfo=accel -ta=tesla,cc60,cc35 -Mcudalib=cusolver,cublas
!       -Mcuda ormqr.f90

program main
  use cublas_v2
  use cusolverDn
  implicit none
  integer,parameter :: prc = SELECTED_REAL_KIND(15,305)

  integer, parameter :: idbg = 1 ! debug write enable
  integer, parameter :: m = 3, lda =m, ldb = m
  integer, parameter :: nrhs = 1 ! number of right hand side vectors
  real(prc)          :: one = 1.0d0
  type(cusolverDnHandle) :: cusolver_H
  type(cublasHandle)     :: cublas_H
  integer                :: cusolver_status
  integer                :: cublas_status
  integer                :: status1, ierr_code
  integer                :: i, j

!       | 1 2 3 |
!   A = | 4 5 6 |
!       | 2 1 1 |
!
!   x = (1 1 1)   exact solution
!   b = (6 15 4)

  real(prc), allocatable, dimension(:) :: A
  real(prc), allocatable, dimension(:) :: B
  real(prc), allocatable, dimension(:) :: Xc   ! solution matrix from GPU

  real(prc), allocatable, dimension(:) :: tau
  real(prc), allocatable, dimension(:) :: d_work
  integer :: devInfo
  integer :: Lwork

  allocate ( A(lda*m), B(ldb*nrhs), Xc(ldb*nrhs) )
  allocate ( tau(m) )

  Lwork = 0

! set a data
  A(1:lda*m)    = (/1.0, 4.0, 2.0, 2.0, 5.0, 1.0, 3.0, 6.0, 1.0/)
  B(1:ldb*nrhs) = (/6.0, 15.0, 4.0/)

! print A
  print *, "A ======"
  do i = 1, m
    do j = 1, m
       print '(1x,a,i5,a,i5,1x,a4,F15.7)', "(",i,",",j,") = ", A(i+(j-1)*lda)
    end do
  end do
! print B
  print *, "B ======"
  do i = 1, ldb
    do j = 1, 1
       print '(1x,a,i5,a,i5,1x,a4,F15.7)', "(",i,",",j,") = ", B(i)
    end do
  end do

! step 1 create cusolver/cublas handle
  cusolver_status = cusolverDnCreate(cusolver_H)
  cublas_status   = cublasCreate(cublas_H)
    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step1 cusolver_status = ", cusolver_status
      if (cublas_status   /= CUBLAS_STATUS_SUCCESS  ) print *, "step1 cublas_status = ", cublas_status
    end if

!$acc data copy(A, B, devInfo) create(tau)

! step 2 query working space of geqrf and ormqr
!$acc host_data use_device(A)
  cusolver_status = cusolverDnDgeqrf_bufferSize(cusolver_H, &
                    m, m, A, lda, Lwork)
!$acc end host_data

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step2 cusolver_status = ", cusolver_status
    end if

! step 3 d_work buffer allocated
    allocate(d_work(Lwork), stat=ierr_code)  ! words
      if (idbg == 1) then
        if (ierr_code /= 0) print *, "step3: alloc ierr_code = ", ierr_code
      end if
    print *, "Lwork = ", Lwork

!$acc enter data copyin(d_work)

! step 4  compute QR factorization
!$acc host_data use_device(A, tau, d_work, devInfo)
  cusolver_status = cusolverDnDgeqrf(cusolver_H, &
                    m, m, A, lda, tau, d_work, Lwork, devInfo)
!$acc end host_data
!$acc update self(devInfo)
  print *, "after geqrf: devInfo =", devInfo

! step 5: compute Q^T*B
!$acc host_data use_device(A, B, tau, d_work, devInfo)
  cusolver_status = cusolverDnDormqr(cusolver_H, CUBLAS_SIDE_LEFT, CUBLAS_OP_T, &
                    m, nrhs, m, A, lda, tau, B, ldb, d_work, Lwork, devInfo)
!$acc end host_data

!$acc update self(devInfo)

    print *, "after ormqr: devInfo =", devInfo
    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step5 cusolver_status = ", cusolver_status
    end if

! step 6: compute x = R \ Q^T*B
!$acc host_data use_device(A, B)
  cublas_status = cublasDtrsm_v2(cublas_H, CUBLAS_SIDE_LEFT, CUBLAS_FILL_MODE_UPPER, CUBLAS_OP_N, &
                   CUBLAS_DIAG_NON_UNIT, m, nrhs, one, A, lda, B, ldb)
!$acc end host_data

!$acc exit data delete(tau, d_work)
!$acc end data

    if (idbg == 1) then
      if (cublas_status /= CUBLAS_STATUS_SUCCESS) print *, "step6 cublas_status = ", cublas_status
    end if

! copy [B].output to result [Xc]
  Xc = B

! atep 5 print out

  print *, "Result =====================  "
  do i = 1, m
    do j = 1, nrhs
       print '(1x,a,i5,a,i5,1x,a4,F15.7)', "(",i,",",j,") = ", Xc(i+(j-1)*ldb)
    end do
  end do

  deallocate(A, B, Xc, tau, d_work)

  end program main

 コンパイルオプションとして、-acc -Mcudalib=cusolver,cublas -Mcuda -ta=tesla オプションを付けてコンパイルリンクする必要がある。

コンパイル&実行

[kato@photon32 QRf]$ pgf90 -acc -O2 -Minfo=accel -ta=tesla,cc60,cc35,cuda8.0 
-Mcudalib=cusolver,cublas -Mcuda ormqr.f90
main:
     88, Generating copy(a(:),b(:))
         Generating create(tau(:))
         Generating copy(devinfo)
    107, Generating enter data copyin(d_work(:))
    114, Generating update self(devinfo)
    123, Generating update self(devinfo)
    136, Generating exit data delete(tau(:),d_work(:))
[kato@photon32 QRf]$ a.out
 A ======
 (    1,    1 ) =       1.0000000
 (    1,    2 ) =       2.0000000
 (    1,    3 ) =       3.0000000
 (    2,    1 ) =       4.0000000
 (    2,    2 ) =       5.0000000
 (    2,    3 ) =       6.0000000
 (    3,    1 ) =       2.0000000
 (    3,    2 ) =       1.0000000
 (    3,    3 ) =       1.0000000
 B ======
 (    1,    1 ) =       6.0000000
 (    2,    1 ) =      15.0000000
 (    3,    1 ) =       4.0000000
 Lwork =            15
 after geqrf: devInfo =            0
 after ormqr: devInfo =            0
 Result =====================
 (    1,    1 ) =       1.0000000
 (    2,    1 ) =       1.0000000
 (    3,    1 ) =       1.0000000

7. cuSolverDN: QR分解による直交化 (LAPACK Dgeqrf,Dorgqr 相当)

 QR 分解による直交化計算の例を示す。以下の例は、C.2. orthogonalization と同じ処理を Fortran OpenACC で書いたものである。なお、密行列 A は、 column-major order で入力する必要がある。

Orthgonalization by QR Factorization with OpenACC Fortran

CUDA Library を使用して QR 分解による「直交化」の計算を行う。

A = Q * R

 以下のプログラムでは、3 x 2 密行列の例を示す。
A = 1.0 2.0 4.0 5.0 2.0 1.0

 cuSOLVEルーチン(geqrf,orgqr)を利用する。

Step 1: A = Q*R by geqrf.
Step 2:form Q by orgqr.
Step 3: check if Q is unitary or not.

OpenACC + cuSOLVE-geqrf,orgqr (ortho.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.

! pgf90 -acc -O2 -Minfo=accel -ta=tesla,cc60,cuda8.0 -Mcudalib=cusolver,cublas 
! -Mcuda ormqr.f90 -Mcuda

program main
  use cublas_v2
  use cusolverDn
  implicit none
  integer,parameter :: prc = SELECTED_REAL_KIND(15,305)

  integer, parameter :: idbg = 1 ! debug write enable
  integer, parameter :: m = 3, n = 2, lda =m
  real(prc)          :: beta = 1.0d0, alpha = -1.0d0
  type(cusolverDnHandle) :: cusolver_H
  type(cublasHandle)     :: cublas_H
  integer                :: cusolver_status, cusolver_status2
  integer                :: cublas_status
  integer                :: status1, ierr_code
  integer                :: i, j

!       | 1 2 |
!   A = | 4 5 |
!       | 2 1 |
!
!   x = (1 1 1)   exact solution
!   b = (6 15 4)

  real(prc), allocatable, dimension(:) :: A
  real(prc), allocatable, dimension(:) :: Q   ! orthonormal columns
  real(prc), allocatable, dimension(:) :: R   ! R = I - Q**T*Q
  real(prc)                            :: R_nrm2

  real(prc), allocatable, dimension(:) :: tau
  real(prc), allocatable, dimension(:) :: d_work
  integer :: devInfo
  integer :: Lwork

  integer :: Lwork_geqrf
  integer :: Lwork_orgqr

  allocate ( A(lda*n), Q(lda*n), R(n*n) )
  allocate ( tau(m) )

  Lwork = 0
  Lwork_geqrf = 0
  Lwork_orgqr = 0

! set a data
  A(1:lda*n)  = (/1.0, 4.0, 2.0, 2.0, 5.0, 1.0/)

! print A
  print *, "A ======"
  do i = 1, m
    do j = 1, n
       print '(1x,a,i5,a,i5,1x,a4,F15.7)', "(",i,",",j,") = ", A(i+(j-1)*lda)
    end do
  end do

! step 1 create cusolver/cublas handle
  cusolver_status = cusolverDnCreate(cusolver_H)
  cublas_status   = cublasCreate(cublas_H)
    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step1 cusolver_status = ", cusolver_status
      if (cublas_status   /= CUBLAS_STATUS_SUCCESS  ) print *, "step1 cublas_status = ", cublas_status
    end if

!$acc data copy(A, devInfo) create(R, tau)

! step 2 query working space of geqrf and ormqr
!$acc host_data use_device(A)
  cusolver_status = cusolverDnDgeqrf_bufferSize(cusolver_H, &
                    m, n, A, lda, Lwork_geqrf)
  cusolver_status2= cusolverDnDorgqr_bufferSize(cusolver_H, &
                    m, n, n, A, lda, tau, Lwork_orgqr)
!$acc end host_data

    Lwork = max(Lwork_geqrf, Lwork_orgqr)
    print *, "Lwork_geqrf, Lwork_orgqr, Lwork = ", Lwork_geqrf, Lwork_orgqr, Lwork

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step2 cusolver_status = ", cusolver_status
      if (cusolver_status2/= CUSOLVER_STATUS_SUCCESS) print *, "step2 cusolver_status2= ", cusolver_status
    end if

! step 3 d_work buffer allocated
    allocate(d_work(Lwork), stat=ierr_code)  ! words
    if (idbg == 1) then
      if (ierr_code /= 0) print *, "step3: alloc ierr_code = ", ierr_code
    end if

!$acc enter data copyin(d_work)

! step 4  compute QR factorization
!$acc host_data use_device(A, tau, d_work, devInfo)
    cusolver_status = cusolverDnDgeqrf(cusolver_H, &
                      m, n, A, lda, tau, d_work, Lwork, devInfo)
!$acc end host_data
!$acc update self(devInfo)
    print *, "after geqrf: devInfo =", devInfo

! step 5: compute Q
!$acc host_data use_device(A, tau, d_work, devInfo)
    cusolver_status = cusolverDnDorgqr(cusolver_H, &
                      m, n, n, A, lda, tau, d_work, Lwork, devInfo)
!$acc end host_data

!$acc update self(A, devInfo)

    print *, "after orgqr: devInfo =", devInfo
    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step5 cusolver_status = ", cusolver_status
    end if

! step 6: copy A to Q ... orthonormal columns
    Q = A

! print Q
    print *, "Q =========="
    do i = 1, m
      do j = 1, n
         print '(1x,a,i5,a,i5,1x,a4,F15.7)', "(",i,",",j,") = ", Q(i+(j-1)*lda)
      end do
    end do

! step 7:  measure R = I - Q**T*Q
    R = 0.0
    do i = 1, n
      R( i + n*(i-1) ) = 1.0   ! R(i,i) = 1
    end do
!$acc update device(R)

! R = -Q**T*Q + I  (Q**T = CUBLAS_OP_T, Q = CUBLAS_OP_T)
!$acc host_data use_device(A, R)
    cublas_status = cublasDgemm_v2(cublas_H, CUBLAS_OP_T, CUBLAS_OP_N, &
                  n, n, m, alpha, A, lda, A, lda, beta, R, n)
!$acc end host_data
!$acc update self(R)

    if (idbg == 1) then
      if (cublas_status /= CUBLAS_STATUS_SUCCESS) print *, "step7 cublas_status = ", cublas_status
    end if

    R_nrm2 = 0.0d0
!$acc host_data use_device(R)
    cublas_status = cublasDnrm2_v2(cublas_H, n*n, R, 1, R_nrm2)
!$acc end host_data

!$acc exit data delete(tau, d_work)
!$acc end data

! atep 5 print |I - Q**T*Q|

    print *, "Result ==== |I - Q**T*Q| ===  "
    print '(1x, a, 2x, D16.6)', '|I - Q**T*Q| =', R_nrm2

    deallocate(A, R, tau, d_work)

  end program main

 コンパイルオプションとして、-acc -Mcudalib=cusolver,cublas -Mcuda -ta=tesla オプションを付けてコンパイルリンクする必要がある。

コンパイル&実行

[kato@photon32 Ortho]$ pgf90 -acc -Minfo=accel -ta=tesla,cc60,cc35,cuda8.0 
-Mcudalib=cusolver,cublas ortho.f90 -Mcuda
main:
     85, Generating copy(a(:),devinfo)
         Generating create(r(:),tau(:))
    109, Generating enter data copyin(d_work(:))
    116, Generating update self(devinfo)
    125, Generating update self(a(:),devinfo)
    148, Generating update device(r(:))
    155, Generating update self(r(:))
    166, Generating exit data delete(tau(:),d_work(:))
[kato@photon32 Ortho]$ a.out
 A ======
 (    1,    1 ) =       1.0000000
 (    1,    2 ) =       2.0000000
 (    2,    1 ) =       4.0000000
 (    2,    2 ) =       5.0000000
 (    3,    1 ) =       2.0000000
 (    3,    2 ) =       1.0000000
 Lwork_geqrf, Lwork_orgqr, Lwork =            14            2           14
 after geqrf: devInfo =            0
 after orgqr: devInfo =            0
 Q ==========
 (    1,    1 ) =      -0.2182179
 (    1,    2 ) =       0.5345225
 (    2,    1 ) =      -0.8728716
 (    2,    2 ) =       0.2672612
 (    3,    1 ) =      -0.4364358
 (    3,    2 ) =      -0.8017837
 Result ==== |I - Q**T*Q| ===
 |I - Q**T*Q| =      0.450975D-15

8. cuSolverDN: 特異ベクトル計算を含めた特異値分解(by QR) (LAPACK Dgesvd 相当)

 複素数あるいは実数を成分とする行列に対する特異値分解を行う例である。ここでは、実数値を扱う例を示す。LAPACK の dgesvd ルーチンに相当する。特異値分解は、信号処理や統計学の分野で用いられる。以下の例は、F.1. SVD with singular vectors と同じ処理を Fortran OpenACC で書いたものである。なお、密行列 A は、 column-major order で入力する必要がある。

 以下のプログラムでは、3 x 2 密行列の例を示す。SVD は以下の式で表される。ここで、Σ はその min(m、n)対角要素を除いてゼロである m×n 行列であり、U は m×m ユニタリ行列であり、V は n×n ユニタリ行列である。 Σ の対角要素は A の特異値となる。 それらは実数で、負数ではなく、降順で返される。 U と V の最初の min(m、n)列は、A の左右の特異ベクトルとなる。cusolverDnDgesvd ルーチは、QR アルゴリズムを使用する。もう一つ cusolverDngesvdj も用意されているが、これも同じ機能を有している。ただし、後者はヤコビ法を使用して解く。中小の行列を対象とする場合は、ヤコビ法の並列性により高い GPU 性能を享受できるかもしれない。また、ヤコビ法を使用するとある精度までの近似解を得る場合に便利である。ここでは、QR アルゴリズムを用いて解く、特異値分解(SVD)の例を示すこととする。

A = U * Σ * V H

A = 1.0 2.0 4.0 5.0 2.0 1.0

 次の三つの計算ステップを実装する。

Step 1: A = U*S*VT 特異値分解を行う。
Step 2: 特異値の値を検証する。
Step 3: 残留誤差 A-U*S*VT を cuBLAS を使って計算する。

 Step 3の残留誤差を求める際に、BLAS の NVIDIA extention 関数を使用する。これは、PGI Fortran の Interface Modul には含まれていないため、module cuBLASextというモジュールを作成し、Fortran と cuBLAS 関数とのインタフェースを行なった。

OpenACC + cuSOLVER-cusolverDnDgesvd (svd_simgular.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
! Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
!
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module cuBLASext
! NVIDIA extention : the BLAS-extension functions
! because PGI Fortran don't include NVIDIA extention BLAS module.
use cublas_v2
interface  cublasDdgmm
  integer(c_int) function  cublasDdgmm( handle, side_mode, m, n, &
                                        A, lda, x , incx , C, ldc) &
                                        bind(C,name="cublasDdgmm")
  use iso_c_binding
  import cublasHandle
  type(cublasHandle), value :: handle
  integer(c_int), value     :: side_mode, m, n, lda, ldc, incx
  real(c_double), device    :: A(*), C(*)
  real(c_double), device    :: x(*)
  end function  cublasDdgmm
end interface  cublasDdgmm
end module cuBLASext

module aux_routine
contains

subroutine print_D_matrix(m, n, a, lda)
integer :: m, n, i, j
real(8) :: a(*)

do i=1, m
  do j=1, n
    print "(5(,i5,1x,i5,2x,f10.6,2x))", i, j, a(i+(j-1)*lda)
  end do
end do
end subroutine print_D_matrix
end module aux_routine

program main
  use cublas_v2
  use cuBLASext
  use cusolverDn
  use aux_routine
  implicit none
  integer, parameter :: m=3, n=2
  integer, parameter :: lda = m
  integer, parameter :: idbg = 1
  real(8), parameter :: one = 1.d0, minus_one = -1.d0
  real(8) :: a (lda*n) ! column-major format with dimensions m x n
  real(8) :: u (lda*m) ! m by m unitary matrix
  real(8) :: vt(lda*n) ! n by n unitary matrix
  real(8) :: w (lda*n) ! w = s*vt
  real(8), allocatable, dimension(:) :: s      ! singular values
  real(8), allocatable, dimension(:) :: rwork ! min(m,n)-1
  real(8), allocatable :: workspace_d(:)
  real(8) :: S_exact(n)
  real(8) :: err, diff_s, diff_fro
  integer :: devInfo_d
  integer :: istat, Lwork
  integer :: i
  character*1 :: jobu, jobvt
  type(cusolverDnHandle) :: cusolver_H
  type(cublasHandle)     :: cublas_H
  integer                :: cusolver_status
  integer                :: cublas_status

!        | 1 2  |
!    a = | 4 5  |
!        | 2 1  |
!

  allocate(s(1:min(m,n)))
  allocate(rwork(1:min(m,n)-1))
  Lwork = 0

! set a data
  a(1:lda*n)  = (/1.0, 4.0, 2.0, 2.0, 5.0, 1.0/)

! Exact SV numbers
  S_exact(1:n) =(/7.065283497082729d0, 1.040081297712078d0/)

  cusolver_status = cusolverDnCreate(cusolver_H)
  if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) &
       print *,  'handle creation failed'
  cublas_status   = cublasCreate(cublas_H)
  if (cublas_status   /= CUBLAS_STATUS_SUCCESS  ) &
       print *, "cublas_status = ", cublas_status

!$acc data copy(a) create(s, u, vt, devInfo_d, workspace_d, rwork, w)

        ! Step 1: compute A = U*S*VT

  !$acc host_data use_device(a)
        istat = cusolverDnDgesvd_bufferSize(cusolver_H, m, n, Lwork)
  !$acc end host_data

        if (istat /= CUSOLVER_STATUS_SUCCESS) &
                 write(*,*) 'cusolverDnDgesvd_bufferSize failed'
        print *, "working space (words) =", Lwork
        allocate(workspace_d(Lwork))

  !$acc enter data copyin(workspace_d) copyin(devInfo_d)

        jobu = 'A'  ! all m columns of U
        jobvt= 'A'  ! all m columns of VT

        !$acc host_data use_device(a, s, u, vt, devInfo_d, workspace_d, rwork)
          istat = cusolverDnDgesvd (cusolver_H, jobu, jobvt, &
                   m, n, a, lda, s, u, lda, vt, lda, workspace_d, Lwork, rwork, devInfo_d)
        !$acc end host_data
        !$acc wait

          if (istat /= CUSOLVER_STATUS_SUCCESS) &
                   write(*,*) 'cusolverDnDgesvd failed', istat

        ! copyout u, vt, s, devInfo_d to host
        !$acc update self(u, vt, s, devInfo_d)

        ! print
        print *, "After gesvd:  devInfo_d =",  devInfo_d
        if (devInfo_d /= 0) write(*,*) 'SVD operation failed'
        print *, "S ="
        call  print_D_matrix(n, 1, s, 0)
        print *, "U ="
        call  print_D_matrix(m, m, u, lda)
        print *, "VT ="
        call  print_D_matrix(n, n, vt, lda)

        istat = cusolverDnDestroy(cusolver_H)
        if (istat /= CUSOLVER_STATUS_SUCCESS) &
                 write(*,*) 'handle destruction failed'

        ! Step 2: check accuracy of singular value

        diff_s   = 0.d0
        diff_fro = 0.d0
        do i = 1, n
          err = abs( s(i) - s_exact(i) )
          diff_s = max( err, diff_s )
         end do
         print '(1x, a, d17.5)',"|S - S_exact| = ", diff_s

        ! Step 3: measure residual A-U*S*VT

        ! w = s*vT
        !$acc host_data use_device(s, vt, w)
          cublas_status = cublasDdgmm(cublas_H, CUBLAS_SIDE_LEFT, n, n, &
                                                 vt, lda, s, 1, w, lda)
        !$acc end host_data

        if (idbg == 1) then
          !$acc update self(w)
          print *, "W ="
          call  print_D_matrix(n, n, w, lda)
        end if

    ! Original A matrix is copied to device
        !$acc update device(a)

        ! a := -u*w + a
        !$acc host_data use_device(a, u, w)
          cublas_status = cublasDgemm_v2(cublas_H, CUBLAS_OP_N, CUBLAS_OP_N, &
                      m, n, n, minus_one, u, lda, w, lda, one, a, lda)
          cublas_status = cublasDnrm2_v2(cublas_H, lda*n, a, 1, diff_fro)
        !$acc end host_data

        if (idbg == 1) then
          !$acc update self(a)
          print *, "A = "
          call  print_D_matrix(m, n, a, lda)
        end if

        print '(1x, a, d17.5)',"|A - U*S*VTt| = ", diff_fro

  !$acc exit data delete(workspace_d)
!$acc end data


end program main

 コンパイルオプションとして、-acc -Mcudalib=cusolver,cublas -Mcuda -ta=tesla オプションを付けてコンパイルリンクする必要がある。

コンパイル&実行

$ make run
pgf90 -c -O2 -Minfo -ta=tesla,cuda9.0,cc35,cc60 -Mcudalib=cusolver,cublas  -Kieee -acc -Mcuda svd_simgular.f90
print_d_matrix:
     47, Loop not vectorized/parallelized: contains call
main:
     92, Loop unrolled 6 times (completely unrolled)
         Loop not vectorized: loop count too small
     95, Loop unrolled 2 times (completely unrolled)
    104, Generating create(rwork(:),s(:),u(:))
         Generating copy(a(:))
         Generating create(w(:),workspace_d(:),devinfo_d,vt(:))
    117, Generating enter data copyin(workspace_d(:),devinfo_d)
    132, Generating update self(devinfo_d,vt(:),u(:),s(:))
    152, Loop unrolled 2 times (completely unrolled)
    167, Generating update self(w(:))
    173, Generating update device(a(:))
    183, Generating update self(a(:))
    190, Generating exit data delete(workspace_d(:))
===== Program run =====
a.out
 working space (words) =          131
 After gesvd:  devInfo_d =            0
 S =
    1     1    7.065283
    2     1    1.040081
 U =
    1     1   -0.308219
    1     2    0.488195
    1     3    0.816497
    2     1   -0.906133
    2     2    0.110706
    2     3   -0.408248
    3     1   -0.289695
    3     2   -0.865685
    3     3    0.408248
 VT =
    1     1   -0.638636
    1     2   -0.769509
    2     1   -0.769509
    2     2    0.638636
 |S - S_exact| =       0.22204D-15
 W =
    1     1   -4.512143
    1     2   -5.436800
    2     1   -0.800352
    2     2    0.664233
 A =
    1     1    0.000000
    1     2    0.000000
    2     1    0.000000
    2     2    0.000000
    3     1    0.000000
    3     2    0.000000
 |A - U*S*VTt| =       0.23604D-14

9. cuSolverDN: 特異ベクトル計算を含めた特異値分解(by Jacobi) (LAPACK Dgesvd 相当)

 前段では、QR アルゴリズムを使用して解く特異値分解を行う例を示した。ここでは、ヤコビ法を使用した特異値分解を行う例を示す。LAPACK の dgesvd ルーチンに相当する中小の行列を対象とする場合は、ヤコビ法の並列性により高い GPU 性能を享受できるかもしれない。また、ヤコビ法を使用するとある精度までの近似解を得る場合に便利である。以下の例は、F.2. SVD with singular vectors (via Jacobi method) と同じ処理を Fortran OpenACC で書いたものである。なお、密行列 A は、 column-major order で入力する必要がある。

 以下のプログラムでは、3 x 2 密行列の例を示す。SVD は以下の式で表される。ここで、Σ はその min(m、n)対角要素を除いてゼロである m×n 行列であり、U は m×m ユニタリ行列であり、V は n×n ユニタリ行列である。 Σ の対角要素は A の特異値となる。 それらは実数で、負数ではなく、降順で返される。 U と V の最初の min(m、n)列は、A の左右の特異ベクトルとなる。

A = U * Σ * V H

A = 1.0 2.0 4.0 5.0 2.0 1.0

 PGI 17.7 から提供されたcuSOLVER 用の Fortran interface module(cusolverDn.mod) の中に、ヤコビ法を使用した特異値分解ルーチン cusolverDnDgesvdj を使用するための CUDA C/C++ cuSOLVER ライブラリへの Interface が含まれていないことがわかった。そのため自作した。以下に示す module cusolverDn_gesvdj モジュールの中で C ライブラリを使用するための Interface を定義した。これによって、cuSOLVER の cusolverDnDgesvdj ルーチンを Fortran から使用可能となる。

 以下のプログラム例では、cusolverDnSetStream() で stream tag を作成し、CUDAストリームで cusolverDN ライブラリタスクを重複して動作させることができるようにしている。このプログラムは単一ストリームであるため、この行為自体は意味がないが、CUDAストリームを使用することで、多重のライブラリタスクを動作させるようにプログラムを組みことができる。ここでは、CUDAストリームの使い方を示したに過ぎない。CUDA ストリームを使用した際の注意点は、必ず、後続処理で同期を取らなければいけないポイントにおいて、cudaDeviceSynchronize() 関数を使い同期させることが必要である。この関数は、CUDA (Fortran) API を直接使うことが必要であり、OpenACC レベルでの !$acc wait では同期は取れないことに注意して欲しい。

OpenACC + cuSOLVER-cusolverDnDgesvdj (jacobi_svd.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
! Copyright (c) 2017, NVIDIA CORPORATION. All rights reserved.
!
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module cusolverDn_gesvdj
  use cudafor
  use cusolverDN

  type gesvdjInfo
    type(c_ptr) :: svInfo
  end type gesvdjInfo

! cusolverDnCreateGesvdjInfo(info)
  interface
     integer(c_int) function cusolverDnCreateGesvdjInfo(info) bind(C,name='cusolverDnCreateGesvdjInfo')
       import gesvdjInfo
       type(gesvdjInfo) :: info
     end function cusolverDnCreateGesvdjInfo
  end interface

! cusolverDnXgesvdjSetTolerance(info, tolerance)
  interface
     integer(c_int) function cusolverDnXgesvdjSetTolerance(info, tolerance) &
     bind(C,name='cusolverDnXgesvdjSetTolerance')
       use iso_c_binding
       import gesvdjInfo
       type(gesvdjInfo)       :: info
       real(c_double), value  :: tolerance
     end function cusolverDnXgesvdjSetTolerance
  end interface

! cusolverDnXgesvdjSetMaxSweeps(info, max_sweeps)
  interface
     integer(c_int) function cusolverDnXgesvdjSetMaxSweeps(info, max_sweeps) &
     bind(C,name='cusolverDnXgesvdjSetMaxSweeps')
       use iso_c_binding
       import gesvdjInfo
       type(gesvdjInfo)       :: info
       integer(c_int),value   :: max_sweeps
     end function cusolverDnXgesvdjSetMaxSweeps
  end interface

! cusolverDnDgesvdj_bufferSize
  interface
     integer(c_int) function cusolverDnDgesvdj_bufferSize(cusolver_Hndl, jobz, econ, &
     m, n, a, lda, s, u, ldu, v, ldv, lwork, info) bind(C,name='cusolverDnDgesvdj_bufferSize')
       use iso_c_binding
       import cusolverDnHandle
       import gesvdjInfo
       type(cusolverDnHandle), value :: cusolver_Hndl
       integer(c_int), value  :: jobz
       integer(c_int),value          :: econ, m, n, lda, ldu, ldv
       real(c_double), device        :: a(*), s(*), u(*), v(*)
       type(gesvdjInfo)              :: info
     end function cusolverDnDgesvdj_bufferSize
  end interface

! cusolverDnDgesvdj
  interface
     integer(c_int) function cusolverDnDgesvdj(cusolver_Hndl, jobz, econ, &
     m, n, a, lda, s, u, ldu, v, ldv, work, lwork, devinfo, info ) bind(C,name='cusolverDnDgesvdj')
       use iso_c_binding
       import cusolverDnHandle
       import gesvdjInfo
       type(cusolverDnHandle), value :: cusolver_Hndl
       integer(c_int), value  :: jobz
       integer(c_int),value          :: econ, m, n, lda, ldu, ldv
       real(c_double), device        :: a(*), s(*), u(*), v(*), work(*)
       integer(c_int)                :: devinfo
       type(gesvdjInfo)              :: info
     end function cusolverDnDgesvdj
  end interface

! cusolverDnXgesvdjGetSweeps
  interface
     integer(c_int) function cusolverDnXgesvdjGetSweeps(cusolver_Hndl, info, executed_sweeps) &
     bind(C,name='cusolverDnXgesvdjGetSweeps')
       use iso_c_binding
       import cusolverDnHandle
       import gesvdjInfo
       type(cusolverDnHandle), value :: cusolver_Hndl
       type(gesvdjInfo)              :: info
       integer(c_int)                :: executed_sweeps
     end function cusolverDnXgesvdjGetSweeps
  end interface

! cusolverDnXgesvdjGetResidual
  interface
     integer(c_int) function cusolverDnXgesvdjGetResidual(cusolver_Hndl, info, residual) &
     bind(C,name='cusolverDnXgesvdjGetResidual')
       use iso_c_binding
       import cusolverDnHandle
       import gesvdjInfo
       type(cusolverDnHandle), value :: cusolver_Hndl
       type(gesvdjInfo)              :: info
       real(c_double)                :: residual
     end function cusolverDnXgesvdjGetResidual
  end interface

end module cusolverDn_gesvdj

module aux_routine
contains

subroutine print_D_matrix(m, n, a, lda)
integer :: m, n, i, j
real(8) :: a(*)

do i=1, m
  do j=1, n
    print "(5(,i5,1x,i5,2x,D20.14,2x))", i, j, a(i+(j-1)*lda)
  end do
end do
end subroutine print_D_matrix
end module aux_routine

program main
  use cudafor           ! cudaStreamCreateWithFlags in the cudafor.mod
  use cusolverDn
  use cusolverDn_gesvdj
  use aux_routine
  implicit none

  integer, parameter :: m=3, n=2
  integer, parameter :: lda = m
  integer, parameter :: idbg = 1
  real(8), parameter :: one = 1.d0, minus_one = -1.d0

  real(8) :: a (lda*n) ! column-major format with dimensions m x n
  real(8) :: u (lda*m) ! m by m unitary matrix
  real(8) :: vt(lda*n) ! n by n unitary matrix
  real(8), allocatable, dimension(:) :: s      ! singular value
  real(8), allocatable :: workspace_d(:)
  real(8) :: S_exact(n)
  real(8) :: err, diff_s
  integer :: devInfo_d
  integer :: istat, Lwork
  integer :: i

! configuration of GESVDJ
  real(8) :: tol
  integer :: max_sweeps
  integer, parameter :: econ=0 ! econ = 1 for economy size

! numerical results of GESVDJ
  real(8) :: residual
  integer :: executed_sweeps

! cuda library variables
  type(cusolverDnHandle)         :: cusolver_H
  type(gesvdjInfo)               :: gesvdj_params
  integer                        :: jobz
  integer(kind=cuda_stream_kind) :: stream
  integer                        :: cusolver_status
  integer                        :: cublas_status

!        | 1 2  |
!    a = | 4 5  |
!        | 2 1  |
!

! cusolverDnDgesvdj() argument list
!       jobz   CUSOLVER_EIG_MODE_NOVECTOR: compute singular values only
!              CUSOLVER_EIG_MODE_VECTOR: compute singular value and singular vectors
!       econ   econ = 1 for economy size
!       m      nubmer of rows of A, 0 <= m
!       n      number of columns of A, 0 <= n
!       a      m-by-n
!       lda   leading dimension of A
!       s      min(m,n)
!              the singular values in descending order
!       u      m-by-m if econ = 0
!              m-by-min(m,n) if econ = 1
!       ldu   leading dimension of U, ldu >= max(1,m)
!       vt    n-by-n if econ = 0
!             n-by-min(m,n) if econ = 1
!       ldv   leading dimension of V, ldv >= max(1,n)

! Initialize
  allocate(s(1:min(m,n)))
  Lwork      = 0
  tol        = 1.d-7
  max_sweeps = 15
  residual   = 0.d0
  executed_sweeps = 0
  jobz       = CUSOLVER_EIG_MODE_VECTOR
  print *,"Eigmode=", jobz

  print *, "tol =", tol
  print *, "max sweeps =", max_sweeps
  print *, "econ = ", econ

! set a data
  a(1:lda*n)  = (/1.0, 4.0, 2.0, 2.0, 5.0, 1.0/)
  print *, "A ="
  call  print_D_matrix(m, n, a, lda)

! Exact SV numbers
   S_exact(:) =(/7.065283497082729d0, 1.040081297712078d0/)

!$acc data copy(a) create(s, u, vt, devInfo_d)

        ! Step 1: create handle and stream

    cusolver_status = cusolverDnCreate(cusolver_H)
    if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) &
        print *,  'handle creation failed'

    ! set a CUDA Lib stream -- when performing several small independent computations(非同期モードで実行する)
    ! you explicitly need to set cudaDeviceSynchronize() at an appropriate place by CUDA-API

    istat = cudaStreamCreateWithFlags(stream, cudaStreamNonBlocking)
    istat = cusolverDnSetStream(cusolver_H, stream)

    ! step 2 : configure gesvdj, Tolerance, Max sweeps

    istat = cusolverDnCreateGesvdjInfo(gesvdj_params)
    istat = cusolverDnXgesvdjSetTolerance(gesvdj_params, tol)
    istat = cusolverDnXgesvdjSetMaxSweeps(gesvdj_params, max_sweeps)

        ! Step 3: query workspace of SVD, allocate workspace_d(Lwork)

  !$acc host_data use_device(a, s, u, vt)
        istat = cusolverDnDgesvdj_bufferSize(cusolver_H, jobz, econ, &
            m, n, a, lda, s, u, lda, vt, lda, Lwork, gesvdj_params)
  !$acc end host_data

        if (istat /= CUSOLVER_STATUS_SUCCESS) &
                 write(*,*) 'cusolverDnDgesvdj_bufferSize failed'
        print *, "working space (words) =", Lwork
        allocate(workspace_d(Lwork))

  !$acc enter data copyin(workspace_d) copyin(devInfo_d)

    ! step 4 : compute SVD

        !$acc host_data use_device(a, s, u, vt, devInfo_d, workspace_d)
          istat = cusolverDnDgesvdj (cusolver_H, jobz, econ, &
                   m, n, a, lda, s, u, lda, vt, lda, workspace_d, Lwork, devInfo_d, gesvdj_params)
        !$acc end host_data

    ! Synchronize for cusolverDnDgesvdj completion (必ず必要)
     istat =cudaDeviceSynchronize()                 ! CUDA-API, it can't use a OpenACC wait directive

      if (istat /= CUSOLVER_STATUS_SUCCESS) &
         write(*,*) 'cusolverDnDgesvd failed', istat
    ! copyout u, vt, s, devInfo_d to host
      !$acc update self(s, u, vt, devInfo_d)

        ! Step 5: measure residual |A - U*S*V**H|_F

    istat = cusolverDnXgesvdjGetSweeps(cusolver_H, gesvdj_params, executed_sweeps)
    istat = cusolverDnXgesvdjGetResidual(cusolver_H, gesvdj_params, residual)

  !$acc exit data delete(workspace_d)
!$acc end data

        ! print out

        print *, "After gesvd:  devInfo_d =",  devInfo_d
        if (devInfo_d == 0)  then
          write(*,*) 'SVD operation was converged'
        else if (devInfo_d < 0) then
          write(*,*) -devInfo_d, "d-th parameter is wrong"
          stop
        else
          write(*,*) "WARNING: gesvdj does not converge ", devInfo_d
        end if

        print *, "Singular values ="
        call  print_D_matrix(n, 1, s, 0)
        print *, "U = left singular vectors"
        call  print_D_matrix(m, m, u, lda)
        print *, "VT = right singular vectors"
        call  print_D_matrix(n, n, vt, lda)

        ! check accuracy of singular value

        diff_s = 0.d0
        do i = 1, n
          err = abs( s(i) - s_exact(i) )
          diff_s = max( err, diff_s )
         end do
         print '(1x, a, d17.5)',"|S - S_exact|  = ", diff_s

        ! measure residual A-U*S*VT**H

        print '(1x, a, d17.5)',"|A - U*S*V**H| = ", residual
        print '(1x, a, I5   )',"number of executed sweeps  = ", executed_sweeps


        istat = cusolverDnDestroy(cusolver_H)
        if (istat /= CUSOLVER_STATUS_SUCCESS) &
                 write(*,*) 'handle destruction failed'

end program main

 コンパイルオプションとして、-acc -Mcudalib=cusolver -Mcuda -ta=tesla オプションを付けてコンパイルリンクする必要がある。

コンパイル&実行

$ make run
pgf90 -c -O2 -Minfo -ta=tesla,cuda9.0,cc35,cc60 -Mcudalib=cusolver,cublas -Kieee -acc -Mcuda  jacobi_svd.f90
print_d_matrix:
    126, Loop not vectorized/parallelized: contains call
main:
    210, Loop unrolled 6 times (completely unrolled)
         Loop not vectorized: loop count too small
    215, Loop unrolled 2 times (completely unrolled)
    217, Generating create(devinfo_d,s(:),vt(:),u(:))
         Generating copy(a(:))
    249, Generating enter data copyin(workspace_d(:),devinfo_d)
    264, Generating update self(vt(:),u(:),s(:),devinfo_d)
    271, Generating exit data delete(workspace_d(:))
    296, Loop unrolled 2 times (completely unrolled)
pgf90 -o  a.out jacobi_svd.o -ta=tesla,cuda9.0,cc35,cc60 -Mcudalib=cusolver -acc -Mcuda
===== Program run =====
a.out
 Eigmode=            1
 tol =   9.9999999999999995E-008
 max sweeps =           15
 econ =             0
 A =
    1     1  0.10000000000000D+01
    1     2  0.20000000000000D+01
    2     1  0.40000000000000D+01
    2     2  0.50000000000000D+01
    3     1  0.20000000000000D+01
    3     2  0.10000000000000D+01
 working space (words) =         3168
 After gesvd:  devInfo_d =            0
 SVD operation was converged
 Singular values =
    1     1  0.70652834970827D+01
    2     1  0.10400812977121D+01
 U = left singular vectors
    1     1  0.30821892063279D+00
    1     2  -.48819507401990D+00
    1     3  0.81649658092773D+00
    2     1  0.90613333377729D+00
    2     2  -.11070553170904D+00
    2     3  -.40824829046386D+00
    3     1  0.28969549251172D+00
    3     2  0.86568461633075D+00
    3     3  0.40824829046386D+00
 VT = right singular vectors
    1     1  0.63863583713640D+00
    1     2  0.76950910814953D+00
    2     1  0.76950910814953D+00
    2     2  -.63863583713640D+00
 |S - S_exact|  =       0.88818D-15
 |A - U*S*V**H| =       0.36550D-14
 number of executed sweeps  =     1

10. cuSolverSP: 並列ダイレクト・スパース・ソルバー (CSR QR分解)

 PGI 17.7 においては、 cuSolverDN type ルーチン群に対する Fortran MODULE インタフェースが提供されたが、cuSolverSP ならびに cuSolverRF 型のライブラリに関してはまだ提供されていない。今後、順次提供されるものと思うが、ここでは、cuSolverSP 用の Fortran MODULE インタフェースを作成して、スパース QR 分解を行うプログラム例を提示する。ここで扱う例は、NVIDIA cuSOLVER の reference manual の中に例示している Batched Sparse QR example 1 を筆者が Fortran 言語で書き換えたものを示す。なお、ここでのライブラリは、複数の線形システムを解く「バッチ処理用ルーチン」を使用している。ここで使用されているバッチ・ソルバーに関する技術的な説明は、NVIDIA Parallel FORALL:Parallel Direct Solvers with cuSOLVER: Batched QR のブログでも紹介されている。

A i x i = b i

        | 1                |   Marices A (small perturbations 摂動) を複数用意する。
    A = |       2          |  バッチ処理を行うため
        |            3     |
        | 0.1  0.1  0.1  4 |

 このプログラムを、CUDA Fortran ならびに OpenACC Fortran から利用する場合を掲示する。

 配列、変数のデバイスメモリのマネージメントをどちら側のプログラミングモデルで行うかによって、プログラミングの対応は異なるが、ここでは、 CUDA Fortran 側でデバイス配列の宣言を行った場合の方法、OpenACC 側で配列の宣言を行った場合を以下に例示する。

Batched Sparse QR example with CUDA Fortran

 PGI 17.7 時点では CUDA C 言語で開発されている cuSOLVER SP(スパース)系のライブラリへの Fortran Module インタフェースが提供されていないため、以下のように自作(cusolver_mod.cuf)した。C と Fortran のバインディングを取り持つためのインタフェースである。将来的に、、PGI コンパイルシステムの中に、cuSOLVER SP の Fortran MODULE (多分 use cusplverSP で使用できるはず)が提供された場合は、以下の cusolver_mod.cuf ルーチンは必要なくなる。まず最初に、CUDA Fortran で coding した例を提示する。この後で、このプログラム変更し OpenACC を使って coding した例を示すことにする。なお、使用したバッチ用ライブラリルーチン(cusolverSpXcsrqrBatched())の仕様は、こちらを参考にして欲しい。この関数の引数仕様の説明の中で、host or device の区別があるが、これは、どちら側に置かれた変数かを示すものである。これを見ることによって、デバイス変数として対処する必要があるかどうかが分かる。

cuSOLVER SP QR 用の Fortran Module Interface(cusolver_mod.cuf)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
module cuSOLVER_SP_QR
  use cudafor
  use cusparse

  enum, bind(C) ! cusolverStatus_t
    enumerator :: CUSOLVER_STATUS_SUCCESS=0
    enumerator :: CUSOLVER_STATUS_NOT_INITIALIZED=1
    enumerator :: CUSOLVER_STATUS_ALLOC_FAILED=2
    enumerator :: CUSOLVER_STATUS_INVALID_VALUE=3
    enumerator :: CUSOLVER_STATUS_ARCH_MISMATCH=4
    enumerator :: CUSOLVER_STATUS_MAPPING_ERROR=5
    enumerator :: CUSOLVER_STATUS_EXECUTION_FAILED=6
    enumerator :: CUSOLVER_STATUS_INTERNAL_ERROR=7
    enumerator :: CUSOLVER_STATUS_MATRIX_TYPE_NOT_SUPPORTED=8
    enumerator :: CUSOLVER_STATUS_NOT_SUPPORTED = 9
    enumerator :: CUSOLVER_STATUS_ZERO_PIVOT=10
    enumerator :: CUSOLVER_STATUS_INVALID_LICENSE=11
  end enum

  enum, bind(C) ! cusolverEigType_t
    enumerator :: CUSOLVER_EIG_TYPE_1=1
    enumerator :: CUSOLVER_EIG_TYPE_2=2
    enumerator :: CUSOLVER_EIG_TYPE_3=3
  end enum

  type cusolverSpContext
        type(c_ptr)     :: cusolverSpHandle
  end type cusolverSpContext
  type csrqrInfo
    type(c_ptr) :: Info
  end type csrqrInfo
  type cusolverSpHandle
    type(c_ptr) :: SpHandle
  end type cusolverSpHandle

! cusolverSpCreate
  interface
     integer(c_int) function cusolverSpCreate(handle) bind(C,name='cusolverSpCreate')
       import cusolverSpHandle
       type(cusolverSpHandle) :: handle
     end function cusolverSpCreate
  end interface
! cusolverSpDestroy
  interface
     integer(c_int) function cusolverSpDestroy(handle) bind(C,name='cusolverSpDestroy')
       import cusolverSpHandle
       type(cusolverSpHandle), value :: handle
     end function cusolverSpDestroy
  end interface

! cusolverSpCreateCsrqrInfo
  interface
     integer(c_int) function cusolverSpCreateCsrqrInfo(info) bind(C,name='cusolverSpCreateCsrqrInfo')
       import csrqrInfo
       type(csrqrInfo) :: info
     end function cusolverSpCreateCsrqrInfo
  end interface
! cusolverSpDestroyCsrqrInfo
  interface
     integer(c_int) function cusolverSpDestroyCsrqrInfo(info) bind(C,name='cusolverSpDestroyCsrqrInfo')
       import csrqrInfo
       type(csrqrInfo) :: info
     end function cusolverSpDestroyCsrqrInfo
  end interface

! cusolverSpDcsrqrBufferInfoBatched
interface cusolverSpDcsrqrBufferInfoBatched
  integer(c_int) function cusolverSpDcsrqrBufferInfoBatched( cusolver_Hndl, m, n, nnzA, &
     descrA, csrValA, csrRowPtrA, csrColIndA, BatchSize, info, internalDataInBytes, workspaceInBytes ) &
     bind(C,name="cusolverSpDcsrqrBufferInfoBatched")
  use iso_c_binding
  import cusolverSpHandle, cusparseMatDescr, csrqrInfo
  type(cusolverSpHandle), value :: cusolver_Hndl
  type(cusparseMatDescr), value :: descrA
  type(csrqrInfo),        value :: info
  integer(c_int), value         :: m, n, nnzA, batchSize
  real(c_double), device        :: csrValA(*)
!! !pgi$ ignore_tkr (d) csrRowPtrA,  (d) csrColIndA
  integer(c_int),device         :: csrRowPtrA(*), csrColIndA(*)
!! !pgi$ ignore_tkr (k) internalDataInBytes, (k) workspaceInBytes
  integer(8)                    :: internalDataInBytes, workspaceInBytes
  end function  cusolverSpDcsrqrBufferInfoBatched
end interface cusolverSpDcsrqrBufferInfoBatched


! cusolverSpXcsrqrAnalysis
!interface cusolverSpXcsrqrAnalysis
!  integer(c_int) function cusolverSpXcsrqrAnalysis( cusolver_Hndl, m, n, nnzA, &
!                 descrA, csrRowPtrA, csrColIndA,  info ) &
!                 bind(C,name="cusolverSpXcsrqrAnalysis")
!  use iso_c_binding
!  import cusolverSpHandle, cusparseMatDescr, csrqrInfo
!  type(cusolverSpHandle), value :: cusolver_Hndl
!  type(cusparseMatDescr), value :: descrA
!  type(csrqrInfo),        value :: info
!  integer(c_int), value         :: m, n, nnzA
!  !pgi$ ignore_tkr (d) csrRowPtrA,  (d) csrColIndA
!  integer(c_int),device         :: csrRowPtrA(*), csrColIndA(*)
!  end function  cusolverSpXcsrqrAnalysis
!end interface cusolverSpXcsrqrAnalysis

! cusolverSpXcsrqrAnalysisBatched
interface cusolverSpXcsrqrAnalysisBatched
  integer(c_int) function cusolverSpXcsrqrAnalysisBatched( cusolver_Hndl, m, n, nnzA, &
                 descrA, csrRowPtrA, csrColIndA,  info ) &
                 bind(C,name="cusolverSpXcsrqrAnalysisBatched")
  use iso_c_binding
  import cusolverSpHandle, cusparseMatDescr, csrqrInfo
  type(cusolverSpHandle), value :: cusolver_Hndl
  type(cusparseMatDescr), value :: descrA
  type(csrqrInfo),        value :: info
  integer(c_int), value         :: m, n, nnzA
!!  !pgi$ ignore_tkr (d) csrRowPtrA,  (d) csrColIndA
  integer(c_int),device         :: csrRowPtrA(*), csrColIndA(*)
  end function  cusolverSpXcsrqrAnalysisBatched
end interface cusolverSpXcsrqrAnalysisBatched

! cusolverSpDcsrqrsvBatched
interface cusolverSpDcsrqrsvBatched
  integer(c_int) function cusolverSpDcsrqrsvBatched( cusolver_Hndl, m, n, nnzA, &
                 descrA, csrValA, csrRowPtrA, csrColIndA, b, x, BatchSize, info, pBuffer ) &
                 bind(C,name="cusolverSpDcsrqrsvBatched")
  use iso_c_binding
  import cusolverSpHandle, cusparseMatDescr, csrqrInfo
  type(cusolverSpHandle), value :: cusolver_Hndl
  type(cusparseMatDescr), value :: descrA
  type(csrqrInfo),        value :: info
  integer(c_int), value         :: m, n, nnzA, Batchsize
  real(c_double), device        :: csrValA(*)
!!  !pgi$ ignore_tkr (d) csrRowPtrA,  (d) csrColIndA
  integer(c_int),device         :: csrRowPtrA(*), csrColIndA(*)
  real(c_double), device        :: b(*), x(*)
!!  !pgi$ ignore_tkr pBuffer
  character(c_char), device     :: pBuffer(*)

  end function cusolverSpDcsrqrsvBatched
end interface cusolverSpDcsrqrsvBatched

end module cuSOLVER_SP_QR

 次に、Batched Sparse QR 分解を行うためのドライバルーチン(main.cuf) を例示する。なお、行列の CSR フォーマットは、row-major order で入力される。

Batched Sparse QR メインプログラム(main.cuf)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
program main
  use cudafor
  use cusparse
  use cusolver_SP_QR
  implicit none

  integer, parameter :: idbg = 2 ! debug write enable
  integer, parameter :: m = 4, nnzA = 7
  integer, parameter :: batchsize = 4

  integer                :: cusolver_status
  integer                :: status1, status2, status3, status4, status5
  type(cusolverSpHandle) :: cusolver_Hndl
  type(cusparseMatDescr) :: descrA
  type(csrqrInfo)        :: info

  integer                ::   csrRowPtrA(m+1),   csrColIndA(nnzA)
  integer, device        :: d_csrRowPtrA(m+1), d_csrColIndA(nnzA)

  double precision       ::  csrValA(nnzA),             b(m)
  double precision, device:: d_csrValA(nnzA*batchsize), d_b(m*batchsize), d_x(m*batchsize)

  double precision       :: csrValABatch(nnzA*batchsize), bBatch(m*batchsize), xbatch(m*batchsize)

  integer(8)             :: size_internal, size_qr
  character(c_char), device, allocatable :: pBuffer(:)

  ! locals
  integer                :: i, j, colidx, batchId
  integer                :: ierr_code
  double precision       :: eps, Areg, breg, xreg

  ! Result
  integer                :: row, baseA, start, end, col
  double precision       :: csrValAj
  double precision       :: sup_res, r, Ax

  ! Randam Number
  double precision :: rnd

  !      | 1                |
  !  A = |       2          |
  !      |            3     |
  !      | 0.1  0.1  0.1  4 |
  !
  ! CSR of A is based 1 indexing  <==== caution (CUSPARSE_INDEX_BASE_ONE)
  !
  !  b = [1 1 1 1]

! step 1: data setting
    csrRowPtrA(1:m+1)   = (/1, 2, 3, 4, 8/)
    csrColIndA(1:nnzA)  = (/1, 2, 3, 1, 2, 3, 4/)
    csrValA(1:nnzA)     = (/1.0d0, 2.0d0, 3.0d0, 0.1d0, 0.1d0, 0.1d0, 4.0d0/)
    b(1:m)              = (/1.0d0, 1.0d0, 1.0d0, 1.0d0/)

    print *, "sizeof(csrRowPtrA, d_csrRowPtrA,  csrColIndA, d_csrColIndA)", &
              sizeof(csrRowPtrA(1)), sizeof(d_csrRowPtrA(1)),  sizeof(csrColIndA(1)), sizeof(d_csrColIndA(1))

! prepare Aj and bj on host
   do colidx = 1, nnzA
        Areg = csrValA(colidx)
        do batchId = 1, batchSize
           call random_number(rnd)
           eps = dble( mod(rnd,100.d0) + 1.0d0 ) * 1.0D-4
           csrValABatch((batchId-1)*nnzA + colidx) = Areg + eps
        enddo
   enddo

   do j = 1, m
        breg = b(j)
        do batchId = 1, batchSize
           call random_number(rnd)
           eps = dble( mod(rnd,100.d0) + 1.0d0 ) * 1.0D-4
           bBatch((batchId-1)*m + j) = breg + eps;
        enddo
   enddo

    if (idbg == 2) print *,"csrValABatch", csrValABatch
    if (idbg == 2) print *,"bBatch", bBatch

! step 2: create cusolver handle, qr info and matrix descriptor

    status1 = cusolverSpCreate(cusolver_Hndl)
          if (idbg == 1) print *, "status1 = ", status1
    status2 = cusparseCreateMatDescr(descrA)
          if (idbg == 1) print *, "status2 = ", status2
    status3 = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)
          if (idbg == 1) print *, "status3 = ", status3
    status4 = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base=1
          if (idbg == 1) print *, "status4 = ", status4
    status5 = cusolverSpCreateCsrqrInfo(info)
          if (idbg == 1) print *, "status5 = ", status5

! step 3: copy Aj and bj to device
    d_csrValA   = csrValABatch
    d_csrColIndA= csrColIndA
    d_csrRowPtrA= csrRowPtrA
    d_b         = bBatch

! step 4: symbolic analysis

    cusolver_status = cusolverSpXcsrqrAnalysisBatched( &
                      cusolver_Hndl, m, m, nnzA, descrA, d_csrRowPtrA, d_csrColIndA, info )
    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step4 cusolver_status = ", cusolver_status
      print *, 'error code = ' ,cusolver_status, cudaGetErrorString( cudaGetLastError() )
    end if

! step 5: prepare working space

    cusolver_status = cusolverSpDcsrqrBufferInfoBatched( &
                      cusolver_Hndl, m, m, nnzA, descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA, &
                      batchsize, info, size_internal, size_qr)

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step5 cusolver_status = ", cusolver_status
    end if

    print *, "numerical factorization needs internal data (bytes) =", size_internal
    print *, "numerical factorization needs working space (bytes) =", size_qr

    allocate(pBuffer(size_qr), stat=ierr_code)  ! Bytes
      if (idbg == 1) print *, "step5: alloc ierr_code = ", ierr_code

! step 6: numerical factorization

    cusolver_status = cusolverSpDcsrqrsvBatched( &
                      cusolver_Hndl, m, m, nnzA, &
                      descrA, d_csrValA, d_csrRowPtrA, d_csrColIndA, &
                      d_b, d_x, &
                      batchSize, &
                      info, &
                      pBuffer)

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step6 cusolver_status = ", cusolver_status
      print *, 'error code = ' ,cusolver_status, cudaGetErrorString( cudaGetLastError() )
    end if

! step 7: check residual
!         xBatch = [x0, x1, x2, ...]

!   copy into CPU mem
    xbatch = d_x

    if (idbg == 2) print *, "xbatch=",xbatch

    print *, "******* Fortran Index Base is adjusted : BaseA = 0 ********************* "
    baseA = 0

    print *,  " ============  Residual CHECK ==============="
    do batchId = 1 , batchsize
     ! measure |bj - Aj*Xj|
      sup_res = 0.d0
      do row = 1, m
        start = csrRowPtrA(row  ) - baseA
        end   = csrRowPtrA(row+1) - baseA
        Ax = 0.d0

        do colidx =  start, end-1
          col = csrColIndA(colidx) - baseA
          Areg = csrValAbatch((batchId-1)*nnzA + colidx)   ! Areg = csrValAj(colidx)
          Xreg = xBatch((batchId-1)*m + col)               ! Xreg = xj(col)
          Ax = Ax  + Areg * Xreg
        end do

        r = bBatch((batchID-1)*m + row) - Ax               ! r = bj(row) -Ax
        ! sup_res = (sup_res > fabs(r))? sup_res : fabs(r);
        sup_res = max( abs(r), sup_res )

!       if (sup_res > abs(r))  then
!         sup_res = sup_res
!       else
!         sup_res = abs(r)
!       end if
      end do
      print *, "batchId =",  batchId," sup|bj - Aj*xj| =", sup_res
    end do


    print *, " ============  Result X ==============="
    do batchId = 1 , batchsize
       do row = 1, m
          print *, "x(", batchId, ")= [", row,"]",  xBatch((batchId-1)*m + row)
       end do
       print *, ""
    end do
end

 以下の Makefile を copy & paste しただけでは make 実行エラーとなる。makefile 記述上の[ tab ]デリミタが空白に変換されているため、明示的に空白を[ tab ]に変更する必要がある。(以下、一例)

main   : main.f90
[ tab ]  pgf90 main.f90

Makefile(コンパイル方法)

CC      = nvcc
CFLAGS  =
FC      = pgf90
FFLAGS  = -O3 -Minfo -Mcuda=cuda8.0,cc35,cc60 -Mcudalib=cusparse
LDFLAGS = -Mcuda=cuda8.0,cc35,cc60

TARGET  = SPsolve
INCLUDE =
LIBS    =  -lcusolver      # -lgomp
# Ubuntu上では、以下のライブラリパスと-lgomp が必要となる場合あり
# NVIDIA cuSOLVER ライブラリは gcc の openmp ライブラリの一部を必要とする
#LDIR   =  -L/usr/lib/gcc/x86_64-linux-gnu/5


modfile =  cusolver_mod.cuf
SRC1    =  main.cuf

# サフィックスルール適用対象の拡張子の定義
.SUFFIXES: .cuf .cu .o

MODOBJ  = $(modfile:.cuf=.o)
OBJ1    = $(SRC1:.cuf=.o)
MOD1    = cusolver_sp_qr.mod

RUN     : $(TARGET)
                  @echo "===== Program run ====="
                  $(TARGET)
$(TARGET): $(MODOBJ) $(OBJ1)
                  $(FC) -o  $@ $^ $(LDFLAGS) $(LDIR) $(LIBS)


$(MOD1) : $(modfile)
                  $(FC) $(FFLAGS) -c $?
$(OBJ1): $(MOD1) $(SRC1)

.cuf.o :
                  $(FC) -c $(FFLAGS) $<
.cu.o :
                  $(CC) -c $(CFLAGS) $<

.PHONY: clean
clean:
                @echo 'Cleaning up...'
            rm -f *.o  *.mod $(TARGET) *~

コンパイル&実行結果

[kato@photon32 QRtest1]$ make
pgf90 -c -O3 -Minfo -Mcuda=cuda8.0,cc35,cc60 -Mcudalib=cusparse  cusolver_mod.cuf
pgf90 -c -O3 -Minfo -Mcuda=cuda8.0,cc35,cc60 -Mcudalib=cusparse  main.cuf
main:
     71, Loop unrolled 5 times (completely unrolled)
     72, Array assignment / Forall at line 73 fused
         Loop unrolled 7 times (completely unrolled)
         Loop not vectorized: loop count too small
     74, Loop unrolled 4 times (completely unrolled)
     82, Loop not vectorized/parallelized: contains call
         FMA (fused multiply-add) instruction(s) generated
     91, Loop not vectorized/parallelized: contains call
         FMA (fused multiply-add) instruction(s) generated
    172, Loop not vectorized/parallelized: contains call
    175, Outer loop unrolled 4 times (completely unrolled)
    180, Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         FMA (fused multiply-add) instruction(s) generated
    203, Loop not vectorized/parallelized: contains call
pgf90 -o SPsolve cusolver_mod.o main.o -Mcuda=cuda8.0,cc35,cc60 -lcusolver 
===== Program run =====
SPsolve
 sizeof(csrRowPtrA, d_csrRowPtrA,  csrColIndA, d_csrColIndA)            4
            4            4            4
 csrValABatch    1.000190792295671         2.000179731460909
    3.000132064769846        0.1001762263696543        0.1001628016430586
   0.1001500500225553         4.000169009995976         1.000119069206783
    2.000163682999292         3.000144817609972        0.1001437479734487
   0.1001670186653253        0.1001425331039673         4.000182114792401
    1.000106716529557         2.000158878275024         3.000165795016470
   0.1001463869055930        0.1001628171802322        0.1001307016646178
    4.000187350713939         1.000180008450829         2.000115906560827
    3.000160754590159        0.1001700415717258        0.1001531034351497
   0.1001216954552562         4.000196496681088
 bBatch    1.000182450045416         1.000166538252814
    1.000186841054682         1.000110375401338         1.000145236365631
    1.000145251688474         1.000116588275369         1.000155851050659
    1.000125862765850         1.000112255030956         1.000187784792441
    1.000198703067770         1.000133737619532         1.000188679946082
    1.000142956695574         1.000175135506380
 numerical factorization needs internal data (bytes) =                      864
 numerical factorization needs working space (bytes) =                   246528
 xbatch=   0.9999916593410789        0.5000383328163729
   0.3333809377259567        0.2041067826368187         1.000026164043516
   0.5000317024998334        0.3333561041470322        0.2041256247608860
    1.000019144193291        0.5000164066433919        0.3333775041545496
   0.2041369446005358        0.9999537374963444        0.5000653595450339
   0.3333631223478223        0.2041274097506157
 ******* Fortran Index Base is adjusted : BaseA = 0 *********************
  ============  Residual CHECK ===============
 batchId =            1  sup|bj - Aj*xj| =   8.8817841970012523E-016
 batchId =            2  sup|bj - Aj*xj| =   6.6613381477509392E-016
 batchId =            3  sup|bj - Aj*xj| =   1.1102230246251565E-015
 batchId =            4  sup|bj - Aj*xj| =   4.4408920985006262E-016
  ============  Result X ===============
 x(            1 )= [            1 ]   0.9999916593410789
 x(            1 )= [            2 ]   0.5000383328163729
 x(            1 )= [            3 ]   0.3333809377259567
 x(            1 )= [            4 ]   0.2041067826368187

 x(            2 )= [            1 ]    1.000026164043516
 x(            2 )= [            2 ]   0.5000317024998334
 x(            2 )= [            3 ]   0.3333561041470322
 x(            2 )= [            4 ]   0.2041256247608860

 x(            3 )= [            1 ]    1.000019144193291
 x(            3 )= [            2 ]   0.5000164066433919
 x(            3 )= [            3 ]   0.3333775041545496
 x(            3 )= [            4 ]   0.2041369446005358

 x(            4 )= [            1 ]   0.9999537374963444
 x(            4 )= [            2 ]   0.5000653595450339
 x(            4 )= [            3 ]   0.3333631223478223
 x(            4 )= [            4 ]   0.2041274097506157

Batched Sparse QR example with OpenACC Fortran

 cuSOLVER SP(スパース)系のライブラリへの Fortran Module は、上記で使用した cusolver_mod.cuf をそのまま使用するが、ファイルの suffix を cusolver_mod.f90 に変更していただきたい。OpenACC を使用して Batched Sparse QR 分解を行うためのドライバルーチン(main.f90) を以下に例示する。

 OpenACC のディレクティブで留意すべきことは、CUDA Fortran or CUDA C で記述された(ライブラリ)ルーチンをコールする際に渡す実引数が「デバイスポインタ」である場合、これをコンパイルに伝えるために、host_data use_device() を使う。このディレクティブの使い方さえ誤りがなければ、既存のプログラムベースのものを OpenACC を使ってポーティングすることは大きな負担ではないはずだ。

Batched Sparse QR メインプログラム(main.f90)

! Copyright (C) HPC WORLD (Prometech Software, Inc. and GDEP Solutions, Inc.) All rights reserved.
!
! Permission is hereby granted, free of charge, to any person obtaining a
! copy of this software and associated documentation files (the "Software"),
! to deal in the Software without restriction, including without limitation
! the rights to use, copy, modify, merge, publish, distribute, sublicense,
! and/or sell copies of the Software, and to permit persons to whom the
! Software is furnished to do so, subject to the following conditions:
!
! The above copyright notice and this permission notice shall be included in
! all copies or substantial portions of the Software.
!
! THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
! IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
! FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
! THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
! LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
! FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
! DEALINGS IN THE SOFTWARE.
!
program main
  use cusparse
  use cusolver_SP_QR
  implicit none

  integer, parameter :: idbg = 2 ! debug write enable
  integer, parameter :: m = 4, nnzA = 7
  integer, parameter :: batchsize = 4

  integer                :: cusolver_status
  integer                :: status1, status2, status3, status4, status5
  type(cusolverSpHandle) :: cusolver_Hndl
  type(cusparseMatDescr) :: descrA
  type(csrqrInfo)        :: info

  integer                :: csrRowPtrA(m+1),   csrColIndA(nnzA)
  double precision       :: csrValA(nnzA),     b(m)

  double precision       :: csrValABatch(nnzA*batchsize), bBatch(m*batchsize), xbatch(m*batchsize)

  integer(8)             :: size_internal, size_qr
  character(c_char), allocatable :: pBuffer(:)

  ! locals
  integer                :: i, j, colidx, batchId
  integer                :: ierr_code
  double precision       :: eps, Areg, breg, xreg

  ! Result
  integer                :: row, baseA, start, end, col
  double precision       :: csrValAj
  double precision       :: sup_res, r, Ax

  ! Randam Number
  double precision :: rnd

  !      | 1                |
  !  A = |       2          |
  !      |            3     |
  !      | 0.1  0.1  0.1  4 |
  !
  ! CSR of A is based 1 indexing  <==== caution (CUSPARSE_INDEX_BASE_ONE)
  !
  !  b = [1 1 1 1]

! step 1: data setting
    csrRowPtrA(1:m+1)   = (/1, 2, 3, 4, 8/)
    csrColIndA(1:nnzA)  = (/1, 2, 3, 1, 2, 3, 4/)
    csrValA(1:nnzA)     = (/1.0d0, 2.0d0, 3.0d0, 0.1d0, 0.1d0, 0.1d0, 4.0d0/)
    b(1:m)              = (/1.0d0, 1.0d0, 1.0d0, 1.0d0/)

    print *, "sizeof(csrRowPtrA, csrRowPtrA,  csrColIndA, csrColIndA)", &
              sizeof(csrRowPtrA(1)), sizeof(csrRowPtrA(1)),  sizeof(csrColIndA(1)), sizeof(csrColIndA(1))

!!call RANDOM_SEED()
! prepare Aj and bj on host
   do colidx = 1, nnzA
        Areg = csrValA(colidx)
        do batchId = 1, batchSize
           call random_number(rnd)
           eps = dble( mod(rnd,100.d0) + 1.0d0 ) * 1.0D-4
           csrValABatch((batchId-1)*nnzA + colidx) = Areg + eps
        enddo
   enddo

   do j = 1, m
        breg = b(j)
        do batchId = 1, batchSize
           call random_number(rnd)
           eps = dble( mod(rnd,100.d0) + 1.0d0 ) * 1.0D-4
           bBatch((batchId-1)*m + j) = breg + eps;
        enddo
   enddo

   xbatch = 0.d0  ! initilize

    if (idbg == 2) print *,"csrValABatch", csrValABatch
    if (idbg == 2) print *,"bBatch", bBatch

! step 2: create cusolver handle, qr info and matrix descriptor

    status1 = cusolverSpCreate(cusolver_Hndl)
          if (idbg == 1) print *, "status1 = ", status1
    status2 = cusparseCreateMatDescr(descrA)
          if (idbg == 1) print *, "status2 = ", status2
    status3 = cusparseSetMatType(descrA, CUSPARSE_MATRIX_TYPE_GENERAL)
          if (idbg == 1) print *, "status3 = ", status3
    status4 = cusparseSetMatIndexBase(descrA, CUSPARSE_INDEX_BASE_ONE) ! base=1
          if (idbg == 1) print *, "status4 = ", status4
    status5 = cusolverSpCreateCsrqrInfo(info)
          if (idbg == 1) print *, "status5 = ", status5

! step 3: copy Aj and bj to device
!   d_csrValA   = csrValABatch
!   d_csrColIndA= csrColIndA
!   d_csrRowPtrA= csrRowPtrA
!   d_b         = bBatch


! step 4: symbolic analysis

!$acc data copyin(csrValABatch, csrColIndA, csrRowPtrA) copyin(bBatch) copyout(xbatch)
!$acc host_data use_device(csrRowPtrA, csrColIndA)
    cusolver_status = cusolverSpXcsrqrAnalysisBatched( &
                      cusolver_Hndl, m, m, nnzA, descrA, csrRowPtrA, csrColIndA, info )
!$acc end host_data

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step4 cusolver_status = ", cusolver_status
    end if

! step 5: prepare working space

!$acc host_data use_device(csrValABatch, csrRowPtrA, csrColIndA)
    cusolver_status = cusolverSpDcsrqrBufferInfoBatched( &
                      cusolver_Hndl, m, m, nnzA, descrA, csrValABatch, csrRowPtrA, csrColIndA, &
                      batchsize, info, size_internal, size_qr)
!$acc end host_data

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step5 cusolver_status = ", cusolver_status
    end if

    print *, "numerical factorization needs internal data (bytes) =", size_internal
    print *, "numerical factorization needs working space (bytes) =", size_qr

    allocate(pBuffer(size_qr), stat=ierr_code)  ! Bytes
      if (idbg == 1) print *, "step5: alloc ierr_code = ", ierr_code

!$acc enter data copyin(pBuffer)

! step 6: numerical factorization

!$acc host_data use_device(csrValABatch, csrRowPtrA, csrColIndA, bBatch, xbatch, pBuffer)
    cusolver_status = cusolverSpDcsrqrsvBatched( &
                      cusolver_Hndl, m, m, nnzA, &
                      descrA, csrValABatch, csrRowPtrA, csrColIndA, &
                      bBatch, xbatch, &
                      batchSize, &
                      info, &
                      pBuffer)
!$acc end host_data
!$acc exit data delete(pBuffer)
!$acc end data

    if (idbg == 1) then
      if (cusolver_status /= CUSOLVER_STATUS_SUCCESS) print *, "step6 cusolver_status = ", cusolver_status
    end if


! step 7: check residual
!         xBatch = [x0, x1, x2, ...]

!   copy into CPU mem

    if (idbg == 2) print *, "xbatch=",xbatch

    print *, "******* Fortran Index Base is adjusted : BaseA = 0 ********************* "
    baseA = 0

    print *,  " ============  Residual CHECK ==============="
    do batchId = 1 , batchsize
     ! measure |bj - Aj*Xj|
      sup_res = 0.d0
      do row = 1, m
        start = csrRowPtrA(row  ) - baseA
        end   = csrRowPtrA(row+1) - baseA
        Ax = 0.d0

        do colidx =  start, end-1
          col = csrColIndA(colidx) - baseA
          Areg = csrValAbatch((batchId-1)*nnzA + colidx)   ! Areg = csrValAj(colidx)
          Xreg = xBatch((batchId-1)*m + col)               ! Xreg = xj(col)
          Ax = Ax  + Areg * Xreg
        end do

        r = bBatch((batchID-1)*m + row) - Ax               ! r = bj(row) -Ax
        ! sup_res = (sup_res > fabs(r))? sup_res : fabs(r);
        sup_res = max( abs(r), sup_res )

      end do
      print *, "batchId =",  batchId," sup|bj - Aj*xj| =", sup_res
    end do


    print *, " ============  Result X ==============="
    do batchId = 1 , batchsize
       do row = 1, m
          print *, "x(", batchId, ")= [", row,"]",  xBatch((batchId-1)*m + row)
       end do
       print *, ""
    end do
end

 以下の Makefile を copy & paste しただけでは make 実行エラーとなる。makefile 記述上の[ tab ]デリミタが空白に変換されているため、明示的に空白を[ tab ]に変更する必要がある。(以下、一例)以下の makefile の中で -Mcuda オプションは、CUDA システムライブラリ・リストをリンカーに指示するために必ず指定して下さい。

main   : main.f90
[ tab ]  pgf90 main.f90

Makefile(コンパイル方法)

CC      = nvcc
CFLAGS  =
FC      = pgf90
FFLAGS  = -O2 -Minfo -acc -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse -Mcuda
LDFLAGS = -acc -ta=tesla,cuda8.0,cc35,cc60 -Mcuda

TARGET  = SPsolve
INCLUDE =
LIBS    =  -lcusolver       # -lgomp
# Ubuntu上では、以下のライブラリパスと-lgompが必要となる場合あり
#LDIR    =  -L/usr/lib/gcc/x86_64-linux-gnu/5

modfile =  cusolver_mod.f90
SRC1    =  main.f90

# サフィックスルール適用対象の拡張子の定義
.SUFFIXES: .f90 .cu .o

MODOBJ  = $(modfile:.f90=.o)
OBJ1    = $(SRC1:.f90=.o)
MOD1    = cusolver_sp_qr.mod

RUN     : $(TARGET)
                  @echo "===== Program run ====="
                  $(TARGET)
$(TARGET): $(MODOBJ) $(OBJ1)
                  $(FC) -o  $@ $^ $(LDFLAGS) $(LDIR) $(LIBS)


$(MOD1) : $(modfile)
                  $(FC) $(FFLAGS) -c $?
$(OBJ1): $(MOD1) $(SRC1)

.f90.o :
                  $(FC) -c $(FFLAGS) $<
.cu.o :
                  $(CC) -c $(CFLAGS) $<

.PHONY: clean
clean:
                @echo 'Cleaning up...'
            rm -f *.o  *.mod $(TARGET) *~

コンパイル&実行結果

[kato@photon32 OpenACC]$ make
pgf90 -c -O2 -Minfo -acc -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse -Mcuda cusolver_mod.f90
pgf90 -c -O2 -Minfo -acc -ta=tesla,cuda8.0,cc35,cc60 -Mcudalib=cusparse -Mcuda main.f90
main:
     67, Loop unrolled 5 times (completely unrolled)
     68, Array assignment / Forall at line 69 fused
         Loop unrolled 7 times (completely unrolled)
         Loop not vectorized: loop count too small
     70, Loop unrolled 4 times (completely unrolled)
     79, Loop not vectorized/parallelized: contains call
         FMA (fused multiply-add) instruction(s) generated
     88, Loop not vectorized/parallelized: contains call
         FMA (fused multiply-add) instruction(s) generated
     95, Loop unrolled 16 times (completely unrolled)
    122, Generating copyin(csrrowptra(:),bbatch(:),csrcolinda(:))
         Generating copyout(xbatch(:))
         Generating copyin(csrvalabatch(:))
    150, Generating enter data copyin(pbuffer(:))
    163, Generating exit data delete(pbuffer(:))
    182, Loop not vectorized/parallelized: contains call
    185, Outer loop unrolled 4 times (completely unrolled)
    190, Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         Unrolled inner loop 4 times
         Generated 2 prefetch instructions for the loop
         FMA (fused multiply-add) instruction(s) generated
    208, Loop not vectorized/parallelized: contains call
pgf90 -o SPsolve cusolver_mod.o main.o -acc -ta=tesla,cuda8.0,cc35,cc60 -Mcuda -lcusolver
===== Program run =====
SPsolve
 sizeof(csrRowPtrA, csrRowPtrA,  csrColIndA, csrColIndA)            4
            4            4            4
 csrValABatch    1.000190792295671         2.000179731460909
    3.000132064769846        0.1001762263696543        0.1001628016430586
   0.1001500500225553         4.000169009995976         1.000119069206783
    2.000163682999292         3.000144817609972        0.1001437479734487
   0.1001670186653253        0.1001425331039673         4.000182114792401
    1.000106716529557         2.000158878275024         3.000165795016470
   0.1001463869055930        0.1001628171802322        0.1001307016646178
    4.000187350713939         1.000180008450829         2.000115906560827
    3.000160754590159        0.1001700415717258        0.1001531034351497
   0.1001216954552562         4.000196496681088
 bBatch    1.000182450045416         1.000166538252814
    1.000186841054682         1.000110375401338         1.000145236365631
    1.000145251688474         1.000116588275369         1.000155851050659
    1.000125862765850         1.000112255030956         1.000187784792441
    1.000198703067770         1.000133737619532         1.000188679946082
    1.000142956695574         1.000175135506380
 numerical factorization needs internal data (bytes) =                      864
 numerical factorization needs working space (bytes) =                   246528
 xbatch=   0.9999916593410789        0.5000383328163729
   0.3333809377259567        0.2041067826368187         1.000026164043516
   0.5000317024998334        0.3333561041470322        0.2041256247608860
    1.000019144193291        0.5000164066433919        0.3333775041545496
   0.2041369446005358        0.9999537374963444        0.5000653595450339
   0.3333631223478223        0.2041274097506157
 ******* Fortran Index Base is adjusted : BaseA = 0 *********************
  ============  Residual CHECK ===============
 batchId =            1  sup|bj - Aj*xj| =   8.8817841970012523E-016
 batchId =            2  sup|bj - Aj*xj| =   6.6613381477509392E-016
 batchId =            3  sup|bj - Aj*xj| =   1.1102230246251565E-015
 batchId =            4  sup|bj - Aj*xj| =   4.4408920985006262E-016
  ============  Result X ===============
 x(            1 )= [            1 ]   0.9999916593410789
 x(            1 )= [            2 ]   0.5000383328163729
 x(            1 )= [            3 ]   0.3333809377259567
 x(            1 )= [            4 ]   0.2041067826368187

 x(            2 )= [            1 ]    1.000026164043516
 x(            2 )= [            2 ]   0.5000317024998334
 x(            2 )= [            3 ]   0.3333561041470322
 x(            2 )= [            4 ]   0.2041256247608860

 x(            3 )= [            1 ]    1.000019144193291
 x(            3 )= [            2 ]   0.5000164066433919
 x(            3 )= [            3 ]   0.3333775041545496
 x(            3 )= [            4 ]   0.2041369446005358

 x(            4 )= [            1 ]   0.9999537374963444
 x(            4 )= [            2 ]   0.5000653595450339
 x(            4 )= [            3 ]   0.3333631223478223
 x(            4 )= [            4 ]   0.2041274097506157

11. cuFFT example with OpenACC Fortran

 OpenACC による cuFFT ライブラリを使用した 2 次元 FFT のプログラムの例を以下に示す。acc data データ領域の指定と acc host_data use_device でデバイスポインタを使用する変数を指定するだけで、FFT の処理が可能となる。

cuFFT example(cuFFT.f90)

program cufft2dTest
  use cufft
  use openacc
  integer, parameter :: m=768, n=512
  complex, allocatable  :: a(:,:),b(:,:),c(:,:)
  real, allocatable     :: r(:,:),q(:,:)
  integer :: iplan1, iplan2, iplan3, ierr

  allocate(a(m,n),b(m,n),c(m,n))
  allocate(r(m,n),q(m,n))

  a = 1; r = 1
  xmx = -99.0

  ierr = cufftPlan2D(iplan1,m,n,CUFFT_C2C)
!$acc data copyin(a)  create(b) copyout(c)
  !$acc host_data use_device(a,b,c)
  ierr = ierr + cufftExecC2C(iplan1,a,b,CUFFT_FORWARD)
  ierr = ierr + cufftExecC2C(iplan1,b,c,CUFFT_INVERSE)
  !$acc end host_data

  ! scale c
  !$acc kernels
  c = c / (m*n)
  !$acc end kernels
!$acc end data

  ! Check forward answer
  write(*,*) 'Max error C2C FWD: ', cmplx(maxval(real(b)) - sum(real(b)), &
                                          maxval(imag(b)))
  ! Check inverse answer
  write(*,*) 'Max error C2C INV: ', maxval(abs(a-c))

  ! Real transform
  ierr = ierr + cufftPlan2D(iplan2,m,n,CUFFT_R2C)
  ierr = ierr + cufftPlan2D(iplan3,m,n,CUFFT_C2R)

!$acc data copyin(r) copyout(q) create(b)
  !$acc host_data use_device(r,b,q)
  ierr = ierr + cufftExecR2C(iplan2,r,b)
  ierr = ierr + cufftExecC2R(iplan3,b,q)
  !$acc end host_data

  !$acc kernels
  xmx = maxval(abs(r-q/(m*n)))
  !$acc end kernels
!$acc end data

  ! Check R2C + C2R answer
  write(*,*) 'Max error R2C/C2R: ', xmx

  ierr = ierr + cufftDestroy(iplan1)
  ierr = ierr + cufftDestroy(iplan2)
  ierr = ierr + cufftDestroy(iplan3)

  if (ierr.eq.0) then
    print *,"test PASSED"
  else
    print *,"test FAILED"
  endif

end program cufft2dTest

 コンパイルとリンクオプションには、必ず、-acc -Mcudalib=cufft を指定することが必要である。

コンパイル&実行結果

[kato@photon32 Example]$ pgf90 -acc -Minfo=accel -O2 -ta=tesla,cc60,cc35,cuda8.0 -Mcudalib=cufft cuFFT.f90
cufft2dtest:
      0, Accelerator kernel generated
         Generating Tesla code
     16, Generating copyin(a(:,:))
         Generating create(b(:,:))
         Generating copyout(c(:,:))
     24, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         24, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     38, Generating copyin(r(:,:))
         Generating copyout(q(:,:))
         Generating create(b(:,:))
     45, Loop is parallelizable
         Accelerator scalar kernel generated
         Accelerator kernel generated
         Generating Tesla code
         45, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
             Generating implicit reduction(max:r$r)

[kato@photon32 Example]$ a.out
 Max error C2C FWD:   (0.000000,0.000000)
 Max error C2C INV:     0.000000
 Max error R2C/C2R:     0.000000
 test PASSED

12. cuRAND example (乱数発生)

cuRAND example from OpenACC Host Code

 OpenACC による cuRAND ライブラリを使用したプログラムの例(XORWOW乱数) を以下に示す。ホストプログラムから CUDA ライブラリを利用するため、デバイス側で使用するデータのみを OpenACC の acc data で指定し、acc host_data use_device でデバイス側で使用する変数を指定するだけで利用可能となる。なお、curand 関数を使用するルーチンでは必ず use curand (インタフェース)を指定することが必要である。

cuRAND test from OpenACC Host Code(cuRand.f90)

program testcurand
! compile with the flags -ta=tesla -Mcuda -Mcudalib=curand
call cur1(1000, .true.); call cur1(1000, .false.)
call cur2(1000, .true.); call cur2(1000, .false.)
call cur3(1000, .true.); call cur3(1000, .false.)
end
!
subroutine cur1(n, onhost)
use curand   ! 必ず指定すること
integer :: a(n)     ! 整数
type(curandGenerator) :: g
integer(8) nbits
logical onhost, passing
a = 0
passing = .true.
if (onhost) then     ! host 側のライブラリも提供している
  istat = curandCreateGeneratorHost(g,CURAND_RNG_PSEUDO_XORWOW)
  istat = curandGenerate(g, a, n)
  istat = curandDestroyGenerator(g)
else                 ! GPU 側処理
  !$acc data copy(a)
  istat = curandCreateGenerator(g,CURAND_RNG_PSEUDO_XORWOW)
  !$acc host_data use_device(a)
  istat = curandGenerate(g, a, n)
  !$acc end host_data
  istat = curandDestroyGenerator(g)
  !$acc end data
endif
nbits = 0
do i = 1, n
  if (i.lt.10) print *,i,a(i)
  nbits = nbits + popcnt(a(i))
end do
print *,"Should be roughly half the bits set"
nbits = nbits / n
if ((nbits .lt. 12) .or. (nbits .gt. 20)) then
  passing = .false.
else
  print *,"nbits is ",nbits," which passes"
endif
if (passing) then
  print *,"Test PASSED"
else
  print *,"Test FAILED"
endif
end
!
subroutine cur2(n, onhost)
use curand
real :: a(n)  ! 単精度
type(curandGenerator) :: g
logical onhost, passing
a = 0.0
passing = .true.
if (onhost) then
  istat = curandCreateGeneratorHost(g,CURAND_RNG_PSEUDO_XORWOW)
  istat = curandGenerate(g, a, n)
  istat = curandDestroyGenerator(g)
else
  !$acc data copy(a)
  istat = curandCreateGenerator(g,CURAND_RNG_PSEUDO_XORWOW)
  !$acc host_data use_device(a)
  istat = curandGenerate(g, a, n)
  !$acc end host_data
  istat = curandDestroyGenerator(g)
  !$acc end data
endif
print *,"Should be uniform around 0.5"
do i = 1, n
  if (i.lt.10) print *,i,a(i)
  if ((a(i).lt.0.0) .or. (a(i).gt.1.0)) passing = .false.
end do
rmean = sum(a)/n
if ((rmean .lt. 0.4) .or. (rmean .gt. 0.6)) then
  passing = .false.
else
  print *,"Mean is ",rmean," which passes"
endif
if (passing) then
  print *,"Test PASSED"
else
  print *,"Test FAILED"
endif
end
!
subroutine cur3(n, onhost)
use curand
real(8) :: a(n)   ! 倍精度
type(curandGenerator) :: g
logical onhost, passing
a = 0.0d0
passing = .true.
if (onhost) then
  istat = curandCreateGeneratorHost(g,CURAND_RNG_PSEUDO_XORWOW)
  istat = curandGenerate(g, a, n)
  istat = curandDestroyGenerator(g)
else
  !$acc data copy(a)
  istat = curandCreateGenerator(g,CURAND_RNG_PSEUDO_XORWOW)
  !$acc host_data use_device(a)
  istat = curandGenerate(g, a, n)
  !$acc end host_data
  istat = curandDestroyGenerator(g)
  !$acc end data
endif
do i = 1, n
  if (i.lt.10) print *,i,a(i)
  if ((a(i).lt.0.0d0) .or. (a(i).gt.1.0d0)) passing = .false.
end do
rmean = sum(a)/n
if ((rmean .lt. 0.4d0) .or. (rmean .gt. 0.6d0)) then
  passing = .false.
else
  print *,"Mean is ",rmean," which passes"
endif
if (passing) then
  print *,"Test PASSED"
else
  print *,"Test FAILED"
endif
end

 コンパイルとリンクオプションには、必ず、-acc -Mcudalib=curand を指定することが必要である。また、システム CUDA ライブラリもリンクする必要があるため、-Mcuda オプションも必須となる。

コンパイル&実行結果

[kato@photon32]$ pgf90 -acc -Minfo=accel -O2 -ta=tesla,cc60,cc35,cuda8.0 -Mcudalib=curand -Mcuda cuRand.f90
cur1:
     21, Generating copy(a(:))
cur2:
     60, Generating copy(a(:))
cur3:
     98, Generating copy(a(:))
[kato@photon32 cuRAND]$ a.out
            1  -1115749450
            2   -339329097
            3    167591721
            4   -133303299
            5   -321518802
            6   1917059131
            7  -1428853312
            8    472148682
            9   2019573489
 Should be roughly half the bits set
 nbits is                        16  which passes
 Test PASSED
            1  -1115749450
            2   -339329097
            3    167591721
            4   -133303299
            5   -321518802
            6   1917059131
            7  -1428853312
            8    472148682
            9   2019573489
 Should be roughly half the bits set
 nbits is                        16  which passes
 Test PASSED
 Should be uniform around 0.5
            1   0.7402194
            2   0.9209938
            3   3.9020490E-02
            4   0.9689629
            5   0.9251406
            6   0.4463501
            7   0.6673192
            8   0.1099307
            9   0.4702186
 Mean is    0.5014752      which passes
 Test PASSED
 Should be uniform around 0.5
            1   0.7402194
            2   0.9209938
            3   3.9020490E-02
            4   0.9689629
            5   0.9251406
            6   0.4463501
            7   0.6673192
            8   0.1099307
            9   0.4702186
 Mean is    0.5014752      which passes
 Test PASSED
            1   0.4384508447184235
            2   0.4603647022031429
            3   0.2502147080176417
            4   0.4947437677616333
            5   5.3011123716805220E-002
            6   0.3376992580306317
            7   0.3967625188337749
            8   0.8744186605648221
            9   0.4821668323147305
 Mean is    0.5150273      which passes
 Test PASSED
            1   0.4384508447184235
            2   0.4603647022031429
            3   0.2502147080176417
            4   0.4947437677616333
            5   5.3011123716805220E-002
            6   0.3376992580306317
            7   0.3967625188337749
            8   0.8744186605648221
            9   0.4821668323147305
 Mean is    0.5150273      which passes
 Test PASSED

cuRAND example from OpenACC device code

 OpenACC の計算領域で cuRAND のデバイスルーチン・ライブラリ関数(curand_uniform等)を使用する場合の例である。これは、ホストプログラム上で使用するものではなく、device 側専用のルーチンである。以下のプログラムでは、module mtests の中で、OpenACC によるデバイスコードを定義している。これは、グローバルサブルーチンとして他のルーチンから利用できる。ここでの技術的な留意点は、PGI Fortran 専用の cuRAND OpenACC 専用 MODULE インタフェース(openacc_curand) を use 文で指定する必要があるという点である。PGI は、cuRand 関数に対して openacc routine ディレクティブを指定した Fortran MODULE を提供している。

cuRAND test from OpenACC device Code(cuRand2.f90)

module mtests
  integer, parameter :: n = 1000
  contains
    subroutine testrand( a, b )
    use openacc_curand   ! OpenACC用のcurand I/F 必ず指定する
    real :: a(n), b(n)
    type(curandStateXORWOW) :: h
    integer(8) :: seed, seq, offset

    !$acc parallel num_gangs(1) vector_length(1) copy(a,b) private(h)
    seed = 12345           ! この時点でデバイスが側に処理が移る。 a, b 配列ともにデバイスへコピー
    seq = 0
    offset = 0
    call curand_init(seed, seq, offset, h)
    !$acc loop seq       ! 以下のループは sequential 実行
    do i = 1, n
      a(i) = curand_uniform(h)
      b(i) = curand_normal(h)
    end do
    !$acc end parallel   ! この時点で a(n), b(n) の値はホスト側に戻される
    return
    end subroutine
end module mtests

program t
use mtests
real :: a(n), b(n), c(n)
logical passing
a = 1.0
b = 2.0
passing = .true.
call testrand(a,b)   ! ホスト側から call する。引数a, b はホスト上の配列を渡す
c = a
print *,"Should be uniform around 0.5"
do i = 1, n
  if (i.lt.10) print *,i,c(i)
  if ((c(i).lt.0.0) .or. (c(i).gt.1.0)) passing = .false.
end do
rmean = sum(c)/n
if ((rmean .lt. 0.4) .or. (rmean .gt. 0.6)) then
  passing = .false.
else
  print *,"Mean is ",rmean," which passes"
endif
c = b
print *,"Should be normal around 0.0"
nc1 = 0;
nc2 = 0;
do i = 1, n
  if (i.lt.10) print *,i,c(i)
  if ((c(i) .gt. -4.0) .and. (c(i) .lt. 0.0)) nc1 = nc1 + 1
  if ((c(i) .gt.  0.0) .and. (c(i) .lt. 4.0)) nc2 = nc2 + 1
end do
print *,"Found on each side of zero ",nc1,nc2
if (abs(nc1-nc2) .gt. (n/10)) npassing = .false.
rmean = sum(c,mask=abs(c).lt.4.0)/n
if ((rmean .lt. -0.1) .or. (rmean .gt. 0.1)) then
  passing = .false.
else
  print *,"Mean is ",rmean," which passes"
endif

if (passing) then
  print *,"Test PASSED"
else
  print *,"Test FAILED"
endif
end program

 コンパイルとリンクオプションには、-acc -Mcudalib=cusparse -Mcuda のオプションの指定が必要である。OpenACC 領域内で cuRAND ライブラリを利用する場合は、PGI Fortran OpenACC 専用の cuRAND のデバイスルーチン・ライブラリをリンクする必要があるため、リンクオプションに注意が必要である。リンク時に NVIDIA LLVM 配下のリンク環境を使うのではなく、従来のリンケージ処理を行うために、-ta=tesla,nollvm サブオプションの指定が必要となる。

コンパイル&実行結果

[kato@photon32 cuRAND]$ pgf90 -acc -Minfo=accel -O2 -ta=tesla,cc60,cc35,cuda8.0,nollvm  cuRand2.f90
testrand:
     10, Generating copy(b(:),a(:))
         Accelerator kernel generated
         Generating Tesla code
         16, !$acc loop seq
     10, CUDA shared memory used for h

[kato@photon32 cuRAND]$ a.out
 Should be uniform around 0.5
            1   0.2988985
            2   0.4019704
            3   0.7425825
            4   0.7073491
            5   0.5512256
            6   0.4850157
            7   9.7107515E-02
            8   0.3596036
            9   0.5777864
 Mean is    0.4975990      which passes
 Should be normal around 0.0
            1    1.348672
            2  -0.3331721
            3    1.273486
            4   -1.008033
            5  -0.6210795
            6  -0.5052772
            7    1.598479
            8    1.133891
            9   -1.831225
 Found on each side of zero           469          531
 Mean is    4.2783763E-02  which passes
 Test PASSEDD

13. cuSPARSE from OpenACC Host Code

 ホスト用プログラムからデバイス側で処理する cuSPARSE ライブラリの利用例を示す。

 cuSPARSE ルーチンの処理には直接関係ないが、以下の例では、cusparseSetStream を使って個々の cuSPARSE ライブラリルーチンによって使われる stream をセットしている。これを行うことで、対象となるハンドル h の cuSPARSE 関数は分離された stream で処理され、GPU 上で自動的なオーバラップ処理が可能となる。このアプローチはとりわけ、シングルタスクの実行が相対的に小さな場合で GPU 内の処理容量を満たさないような時に有効である。あるいは、計算とデータ転送が並行に行われるような場面で使用される。

cuSPARSE from OpenACC Host Code(cuSPARSE.f90)

program sparseMatVec
    integer n
    n = 25 ! # rows/cols in dense matrix
    call sparseMatVecSub1(n)
    n = 45 ! # rows/cols in dense matrix
    call sparseMatVecSub1(n)
end program

subroutine sparseMatVecSub1(n)
  use openacc
  use cusparse

  implicit none

  integer n

  ! dense data
  real(4), allocatable :: Ade(:,:), x(:), y(:)

  ! sparse CSR arrays
  real(4), allocatable :: csrValA(:)
  integer, allocatable :: nnzPerRowA(:), csrRowPtrA(:), csrColIndA(:)

  allocate(Ade(n,n), x(n), y(n))
  allocate(csrValA(n))
  allocate(nnzPerRowA(n), csrRowPtrA(n+1), csrColIndA(n))

  call sparseMatVecSub2(Ade, x, y, csrValA, nnzPerRowA, csrRowPtrA, &
                                                         csrColIndA, n)
  deallocate(Ade)
  deallocate(x)
  deallocate(y)
  deallocate(csrValA)
  deallocate(nnzPerRowA)
  deallocate(csrRowPtrA)
  deallocate(csrColIndA)
end subroutine

subroutine sparseMatVecSub2(Ade, x, y, csrValA, nnzPerRowA, csrRowPtrA, &
                                                             csrColIndA, n)
  use openacc
  use cusparse

  implicit none

  ! dense data
  real(4) :: Ade(n,n), x(n), y(n)

  ! sparse CSR arrays
  real(4) :: csrValA(n)
  integer :: nnzPerRowA(n), csrRowPtrA(n+1), csrColIndA(n)

  integer :: n, nnz, status, i
  type(cusparseHandle) :: h
  type(cusparseMatDescr) :: descrA

  ! parameters
  real(4) :: alpha, beta

  ! result
  real(4) :: xerr

  ! initalize CUSPARSE and matrix descriptor
  status = cusparseCreate(h)
  if (status /= CUSPARSE_STATUS_SUCCESS) &
       write(*,*) 'cusparseCreate error: ', status
  status = cusparseCreateMatDescr(descrA)
  status = cusparseSetMatType(descrA, &
       CUSPARSE_MATRIX_TYPE_GENERAL)
  status = cusparseSetMatIndexBase(descrA, &
       CUSPARSE_INDEX_BASE_ONE)
  status = cusparseSetStream(h, acc_get_cuda_stream(acc_async_sync))  ! cusparseが利用する一つのcuda streamを生成

  !$acc data create(Ade, x, y, csrValA, nnzPerRowA, csrRowPtrA, csrColIndA)

  ! Initialize matrix (upper circular shift matrix) ! Circulant matrix
  !$acc kernels
  Ade = 0.0
  do i = 1, n-1
     Ade(i,i+1) = 1.0
  end do
  Ade(n,1) = 1.0

  ! Initialize vectors and constants
  do i = 1, n
     x(i) = i
  enddo
  y = 0.0
  !$acc end kernels

  !$acc update host(x)
  write(*,*) 'Original vector:'
  write(*,'(5(1x,f7.2))') x

  ! convert matrix from dense to csr format
  !$acc host_data use_device(Ade, nnzPerRowA, csrValA, csrRowPtrA, csrColIndA)
  status = cusparseSnnz_v2(h, CUSPARSE_DIRECTION_ROW, &
       n, n, descrA, Ade, n, nnzPerRowA, nnz)
  status = cusparseSdense2csr(h, n, n, descrA, Ade, n, &
       nnzPerRowA, csrValA, csrRowPtrA, csrColIndA)
  !$acc end host_data

  ! A is upper circular shift matrix
  ! y = alpha*A*x + beta*y
  alpha = 1.0
  beta = 0.0
  !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, x, y)
  status = cusparseScsrmv(h, CUSPARSE_OPERATION_NON_TRANSPOSE, &
       n, n, n, alpha, descrA, csrValA, csrRowPtrA, &
       csrColIndA, x, beta, y)
  !$acc end host_data

  !$acc wait
  write(*,*) 'Shifted vector:'
  write(*,'(5(1x,f7.2))') y

  ! shift-down y and add original x
  ! A' is lower circular shift matrix
  ! x = alpha*A'*y + beta*x
  beta = -1.0
  !$acc host_data use_device(csrValA, csrRowPtrA, csrColIndA, x, y)
  status = cusparseScsrmv(h, CUSPARSE_OPERATION_TRANSPOSE, &
       n, n, n, alpha, descrA, csrValA, csrRowPtrA, &
       csrColIndA, y, beta, x)
  !$acc end host_data

  !$acc kernels
  xerr = maxval(abs(x))
  !$acc end kernels
  !$acc end data

  write(*,*) 'Max error = ', xerr
  if (xerr.le.1.e-5) then
    write(*,*) 'Test PASSED'
  else
    write(*,*) 'Test FAILED'
  endif

end subroutine

 コンパイルとリンクオプションには、-acc -Mcudalib=cusparse -Mcuda が必要となる。

コンパイル&実行結果

[kato@photon32 cuSPARSE]$ pgf90 -Minfo=accel -acc -O2 -ta=tesla,cc60,cuda8.0 -Mcudalib=cusparse cuSparse.f90 -Mcuda
sparsematvecsub2:
      0, Accelerator kernel generated
         Generating Tesla code
     74, Generating create(nnzperrowa(:),x(:),y(:),csrrowptra(:),csrvala(:),csrcolinda(:),ade(:,:))
     78, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         78, !$acc loop gang, vector(4) ! blockidx%y threadidx%y
             !$acc loop gang, vector(32) ! blockidx%x threadidx%x
     79, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         79, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     81, Accelerator scalar kernel generated
     85, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         85, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     88, Loop is parallelizable
         Accelerator kernel generated
         Generating Tesla code
         88, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
     91, Generating update self(x(:))
    128, Loop is parallelizable
         Accelerator scalar kernel generated
         Accelerator kernel generated
         Generating Tesla code
        128, !$acc loop gang, vector(128) ! blockidx%x threadidx%x
             Generating implicit reduction(max:x$r)
[kato@photon32 cuSPARSE]$ a.out
 Original vector:
    1.00    2.00    3.00    4.00    5.00
    6.00    7.00    8.00    9.00   10.00
   11.00   12.00   13.00   14.00   15.00
   16.00   17.00   18.00   19.00   20.00
   21.00   22.00   23.00   24.00   25.00
 Shifted vector:
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
 Max error =     0.000000
 Test PASSED
 Original vector:
    1.00    2.00    3.00    4.00    5.00
    6.00    7.00    8.00    9.00   10.00
   11.00   12.00   13.00   14.00   15.00
   16.00   17.00   18.00   19.00   20.00
   21.00   22.00   23.00   24.00   25.00
   26.00   27.00   28.00   29.00   30.00
   31.00   32.00   33.00   34.00   35.00
   36.00   37.00   38.00   39.00   40.00
   41.00   42.00   43.00   44.00   45.00
 Shifted vector:
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
    0.00    0.00    0.00    0.00    0.00
 Max error =     0.000000
 Test PASSED

前章へ

次章へ

OpenACCプログラミングのインデックスへ