NVIDIA HPC SDKのインストール

NVIDIA HPC SDK　最新バージョンダウンロード

NVIDIA HPC SDK の最新バージョンを無料で使用することができます。

NVIDIA HPC SDKには、開発者の生産性と HPCアプリケーションのパフォーマンスと移植性を最大化するために不可欠な実証済みのコンパイラ、ライブラリ、ソフトウェアツールが含まれています。ここでは、NVIDIA HPC SDKのインストール方法について説明します。なお、NVIDIA HPC SDKは無料で使用することができますが、技術サポートは有償です。
また、NVIDIAおよびプロメテック・ソフトウェアで提供する有償サポートの範囲は、NVIDIA HPC SDKの中でnvcc（CUDA C/C++コンパイラ）を除く、 nvfortran, nvc, nvc++ コンパイラに関するサポートのみとなりますので予めご了承ください。その他のソフトウェア（ライブラリ、ユーティリティ、サードパーティ・ソフトウェア）のサポートは範囲外となります。

NVIDIA HPC SDKのシステム要件

対応する OS、Distribution の詳細exit_to_app

サポートされるCUDA Toolkit のバージョン

NVIDIA HPC SDKは、NVIDIA GPUで実行するプログラムをビルドするときに、CUDA Toolkitのサブセットを使用します。すべてのNVIDIA HPC SDKインストールパッケージは、必要なCUDAコンポーネントを [install-prefix]/[arch]/[nvhpc-version]/cuda というディレクトリにインストールします。
ご利用のシステムで GPU 用にコンパイルされたプログラムを実行する前に、NVIDIA CUDA GPUデバイスドライバーを GPU が搭載されたシステムにインストールする必要があります。NVIDIA HPC SDKには CUDA GPUドライバーは含まれていません。 NVIDIA から適切な CUDA GPUドライバーをダウンロードしてインストールする必要があります。CUDA ToolKit 内にドライバーが含まれますので、CUDA ToolKitをインストールするか、その中に含まれるドライバーを部分選択して実装してください。
NVIDIA HPC SDKに含まれるnvaccelinfo コマンドは、出力の最初の行としてCUDAドライバーのバージョンを表示します。これを使用して、システムにインストールされているCUDAドライバーのバージョンを確認できます。

$ nvaccelinfo

CUDA Driver Version:           11010
NVRM version:                  NVIDIA UNIX x86_64 Kernel Module  455.32.00  Wed Oct 14 22:46:18 UTC 2020

Device Number:                 0
Device Name:                   Quadro GP100
Device Revision Number:        6.0
Global Memory Size:            17033986048
Number of Multiprocessors:     56
Concurrent Copy and Execution: Yes
Total Constant Memory:         65536
Total Shared Memory per Block: 49152
Registers per Block:           65536
Warp Size:                     32
Maximum Threads per Block:     1024
Maximum Block Dimensions:      1024, 1024, 64
Maximum Grid Dimensions:       2147483647 x 65535 x 65535
Maximum Memory Pitch:          2147483647B
Texture Alignment:             512B
Clock Rate:                    1442 MHz
Execution Timeout:             Yes
Integrated Device:             No
Can Map Host Memory:           Yes
Compute Mode:                  default
Concurrent Kernels:            Yes
ECC Enabled:                   No
Memory Clock Rate:             715 MHz
Memory Bus Width:              4096 bits
L2 Cache Size:                 4194304 bytes
Max Threads Per SMP:           2048
Async Engines:                 2
Unified Addressing:            Yes
Managed Memory:                Yes
Concurrent Managed Memory:     Yes
Preemption Supported:          Yes
Cooperative Launch:            Yes
  Multi-Device:                Yes
Default Target:                cc60

NVIDIA HPC SDK 20.9には、次の CUDA Toolkitバージョンのスタンドアロンサポートが含まれています。

CUDA 10.1
CUDA 10.2
CUDA 11.0

nvfortran, nvc, nvc++をCUDA ツールチェーンの代替バージョンで使用する方法については、NVIDIA HPCコンパイラユーザーガイドを参照してください。

NVIDIA HPC SDKインストールガイド

1. Linuxへのインストール

このセクションでは、NVIDIA GPUを備えたLinux x86_64、OpenPOWER、またはArm Serverシステムに一般的な方法でNVIDIA HPC SDKをインストールする方法について説明します。以下の例は、「Linux x86_64 tarball」を利用した方法です。

1.1 Linuxへのインストールの準備

Linuxのインストールでは、NVIDIA HPC SDKソフトウェアをインストールする前に、GNU Compiler Collection（gcc）のいくつかのパッケージをインストールする必要があります。HPCコンパイラーが64ビットの実行可能ファイルを生成するには、64ビットのgccコンパイラーが必要です。C++コンパイルおよびリンクの場合、g++についても同じことが当てはまります。また、Fortranに関してもgfortranが必要となります。各コンパイラがシステムにインストールされているかどうかを確認するには、次の手順を実行します。

1. hello.cプログラムを作成します。

#include 
int main() {
printf("hello, world!\n");
return 0;
}

2. -m64オプションを指定してコンパイルし、64ビットの実行可能ファイルを作成します。

$ gcc -m64 -o hello_64_c hello.c --version 
gcc (GCC) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

生成された実行可能ファイルに対してfileコマンドを実行します。出力は次のようになります。64bit executable であることと、バージョンも確認してください。

$ file ./hello_64_c
./hello_64_c: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), 
for GNU/Linux 2.6.32, not stripped

3. C++コンパイルでサポートするには、少なくとも g++ バージョン4.4が必要です。より新しいバージョンで十分です。hello.cppプログラムを作成し、-m64 引数を指定してg ++を呼び出します。先に進む前に、単純なhello.cppプログラムをコンパイル、リンク、および実行できることを確認してください。

#include 
int main() {
std::cout << "hello, world!\n";
return 0;
}
$ g++ -m64 -o hello_64_cpp hello.cpp --version
g++ (GCC) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
hello_64_cppバイナリ のfileコマンドは 、Cの例と同様の結果を生成するはずです。
4. gfortran も機能することを確認してください。
$ gfortran --version
NU Fortran (GCC) 8.2.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

もし、GNU Compiler Collectionが実装されていない場合は、各 OS のパッケージ管理システムを利用してインストールして下さい。

1.2 インストール手順

1. HPC SDKソフトウェアを解凍します。root権限での実行が必要です。/tmp 等のエリアにダウンロードしたファイルを置きます。インストールするための十分な空きディスク領域があることを確認します。非圧縮インストールパッケージには、合計8 GBの空きディスク領域が必要です。

# cd /tmp
# ls 
nvhpc_2020_209_Linux_x86_64_cuda_11.0.tar.gz (バージョン20.9、Linux_x86_64用の場合）
# tar xvpfz nvhpc_2020_209_Linux_x86_64_cuda_11.0.tar.gz
# ls
nvhpc_2020_209_Linux_x86_64_cuda_11.0  nvhpc_2020_209_Linux_x86_64_cuda_11.0.tar.gz
# cd nvhpc_2020_209_Linux_x86_64_cuda_11.0
# ls
install  install_components

ファイルが展開されます。tar.gzの圧縮ファイル名と同じ名前のディレクトリが生成されるので、そのディレクトリに移動します。

2. インストールスクリプト (./install) を実行します。二つの質問に答えます。ローカルインストール(1)とネットワークインストール(2)のどちらを実行するかを決定します。次に、インストールディレクトリを配置する場所を定義します。デフォルトは /opt/nvidia/hpc_sdk です。

【注意】旧PGIコンパイラ環境では、そのデフォルトの実装場所は /opt/pgi 配下でした。一方、NVIDIA HPC SDKは、実装場所が /opt/nvidia/hpc_sdk (デフォルト）配下となります。異なる配置、コマンド PATH となりますので、旧PGI環境とNVIDIA HPC SDK環境とは明確に異なるものとして利用することができます。なお、NVIDIA HPC SDKは、コマンドの互換性を維持するために、PGIのpgfortran/pgcc/pgc++の旧コマンドでもコンパイルできますが、新しいコンパイラコマンド（nvfortran/nvc/nvc++) を明示的に使用することを推奨します。旧PGI環境を備えたシステム上で旧バージョンと併用するためには、明確に区別して使用する必要があります。大規模システムの運用では、Environment modulesを使用したコマンド環境の選択ができるようにすることをお勧めします。
NVIDIA HPC SDKインストールスクリプトは、指定されたインストールディレクトリ内の適切なサブディレクトリにNVIDIA HPC SDKのすべてのバイナリ、ツール、およびライブラリをインストールします。以下は、「ローカルインストール（シングルシステム用）」の例です。

# ./install   (あるいは、sudo ./install)

Welcome to the NVIDIA HPC SDK Linux installer!
You are installing NVIDIA HPC SDK 2020 version 20.9 for Linux_x86_64.
Please note that all Trademarks and Marks are the properties
of their respective owners.

Press enter to continue...

A network installation will save disk space by having only one copy of the
compilers and most of the libraries for all compilers on the network, and
the main installation needs to be done once for all systems on the network.

1  Single system install
2  Network install

Please choose install option: 1 (ここでは、シングルシステム用を説明）

Please specify the directory path under which the software will be installed.
The default directory is /opt/nvidia/hpc_sdk, but you may install anywhere you wish,
assuming you have permission to do so.(インストール場所のデフォルトは、/opt/nvidia/hpc_sdkとなります）

Installation directory? [/opt/nvidia/hpc_sdk]  デフォルトに実装するので、Enterを押下

Note: directory /opt/nvidia/hpc_sdk was created.

Installing NVIDIA HPC SDK version 20.9 into /opt/nvidia/hpc_sdk
####### (時間が掛かります）
Making symbolic links in /opt/nvidia/hpc_sdk/Linux_x86_64/2020　
   （バージョン20.9のシンボリックリンクディレクトリの 2020 を作成）

Installing NVIDIA HPC SDK CUDA components into /opt/nvidia/hpc_sdk/Linux_x86_64
generating environment modules for NV HPC SDK 20.9 ... done.
Installation complete.

Please check https://developer.nvidia.com for documentation,
use of NVIDIA HPC SDK software, and other questions.

注： Linuxユーザーは、通常のプロンプトを操作することなく、インストールを自動化することができます。これは、たとえば、多くのシステムでのHPCコンパイラの自動インストールをスクリプトで効率的に実行できるような、大規模な組織の設定で役立つ場合があります。サイレントインストール機能を有効にするには、インストールスクリプトを実行する前に適切な環境変数を設定します。これらの変数は次のとおりです。

NVHPC_SILENT	（必須）サイレントインストールを有効にするには、この変数を「true」に設定します。
NVHPC_INSTALL_DIR	（必須）この変数を、必要なインストール場所を含む文字列に設定します。例： /opt/nvidia/hpc_sdk
NVHPC_INSTALL_TYPE	（必須）この変数を設定して、インストールのタイプを選択します。受け入れられる値は、単一システムインストールの場合は「single」、ネットワークインストールの場合は「network」です。
NVHPC_INSTALL_LOCAL_DIR	（ネットワークインストールの場合は必須）ネットワークインストールを選択する場合は必要です。この変数は、コンパイラとツールを使用するネットワーク上の各システム上で、当該システム用の「コンパイラ時に使用するローカルライブラリファイル」を保存する場所を指定します。ファイルシステムへのパスを設定します。例えば、/opt/nvidia/20.7/shared_objects 等のパス名が一般に使用している。
NVHPC_DEFAULT_CUDA	（オプション）この変数を、XX.Yの形式で必要なCUDAバージョンに設定します（例：10.1または11.0）
NVHPC_STDPAR_CUDACC	（オプション）この変数を設定して、C++ stdpar GPU コンパイルがデフォルトで特定の計算機能（60、70、75など）をターゲットにするように強制します。

1.3 エンドユーザー環境設定

ソフトウェアのインストールが完了したら、NVIDIA HPC SDKを使用するために各ユーザーのシェル環境を初期化する必要があります。適宜、システム用の .bashrc、.cshrc 等の初期設定ファイルに設定してください。
注意：各ユーザーは、NVIDIA HPC SDKを使用する前に、次の一連のコマンドを発行してシェル環境を初期化する必要があります。
NVIDIA HPC SDKは、バージョン番号をアーキテクチャタイプのディレクトリ（Linux_x86_64/20.9など）に保持します。アーキテクチャの名前は、 `uname -s`_`uname -m`。OpenPOWERおよびArm Serverプラットフォームの場合、予想されるアーキテクチャー名はそれぞれ「Linux_ppc64le」および「Linux_aarch64」です。以下のガイドでは、必要な uname コマンドの値を「NVARCH」に設定していますが、必要に応じてアーキテクチャの名前を明示的に指定できます。

1.3.1 明示的に環境変数を設定する場合

環境変数を明示的に指定し、NVIDIA HPC SDKを利用できるようにするには、以下のように設定してください。
cshでは、次のコマンドを使用します。なお、以下のバージョン番号（下の例では20.9）をこれにシンボリックリンクした「代表ディレクトリ名」である 2020 に変更しても動作します。以下の例は、バージョン20.9を使用する場合の例です。

% setenv NVARCH `uname -s`_`uname -m`
% setenv NVCOMPILERS /opt/nvidia/hpc_sdk
% setenv MANPATH "$MANPATH":$NVCOMPILERS/$NVARCH/20.9/compilers/man
% set path = ($NVCOMPILERS/$NVARCH/20.9/compilers/bin $path)

bash, sh, kshでは、次のコマンドを使用します。なお、以下のバージョン番号（下の例では20.9）をこれにシンボリックリンクした「代表ディレクトリ名」である 2020 に変更しても動作します。以下の例は、バージョン20.9を使用する場合の例です。

$ NVARCH=`uname -s`_`uname -m`; export NVARCH
$ NVCOMPILERS=/opt/nvidia/hpc_sdk; export NVCOMPILERS
$ MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/20.9/compilers/man; export MANPATH
$ PATH=$NVCOMPILERS/$NVARCH/20.9/compilers/bin:$PATH; export PATH

さらに、Open MPIコマンドとマニュアルページにアクセスできるようにするための環境変数の設定は以下のとおりです。なお、以下のバージョン番号（下の例では20.9）をこれにシンボリックリンクした「代表ディレクトリ名」である 2020 に変更しても動作します。以下の例は、csh 系でバージョン20.9を使用する場合の例です。

% set path = ($NVCOMPILERS/$NVARCH/20.9/comm_libs/mpi/bin $path)
% setenv LD_LIBRARY_PATH "$LD_LIBRARY_PATH":$NVCOMPILERS/$NVARCH/20.9/comilers/lib
% setenv MANPATH "$MANPATH":$NVCOMPILERS/$NVARCH/20.9/comm_libs/mpi/man

bash, sh, ksh では、以下のように設定します。

$ export PATH=$NVCOMPILERS/$NVARCH/20.9/comm_libs/mpi/bin:$PATH
$ export LD_LIBRARY_PATH=$NVCOMPILERS/$NVARCH/20.9/compilers/lib:$LD_LIBRARY_PATH
$ export MANPATH=$MANPATH:$NVCOMPILERS/$NVARCH/20.9/comm_libs/mpi/man

1.3.2 Environment Modulesを使う場合

Environment Modulesは、動的にコンパイラやライブラリ、ソフトウェアの種類・バージョンを切り替えるためのミドルウェアで、スーパーコンピューターで一般的に用いられています。$NVCOMPILERS/modulefilesにmodulefileのサンプルがあり、インストールしたシステムでそのまま使用できます。大規模システムのような複数のユーザーが並行して使用する環境では、こちらの使用をお勧めします。

$ pwd
/opt/nvidia/hpc_sdk/modulefiles  
（環境変数NVCOMPILERSで設定したdirectoryの配下にmodulefilesディレクトリがある)
$ ls (3種のサンプルあり）
nvhpc/  nvhpc-byo-compiler/  nvhpc-nompi/

以下は .bashrc でmodulefilesディレクトリをEnvironment modulesに読み込み、nvhpc/20.9 moduleをロードする例です。他のシェル環境でも同様に利用可能です。

$ cat ~/.bashrc
...

module use --append /opt/nvidia/hpc_sdk/modulefiles

...
$ module avail
...

--------------- /opt/nvidia/hpc_sdk/modulefiles ----------------
nvhpc/20.9              nvhpc-nompi/20.9
nvhpc-byo-compiler/20.9

...
$ module load nvhpc/20.9
$ which nvc
/opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/nvc

Environment Modulesでは初期化時に読み込むmodulefilesディレクトリを設定できますので、必要な場合はマニュアルを参照し設定してください。

1.4 コンパイラの利用

上記の環境変数の設定が反映された端末セッションでコンパイラを利用できます。NVIDIA HPC SDKのコンパイラ・コマンド名は、nvfortran, nvc++, nvcですが、旧 PGI コンパイラとの互換性を提供するため、PGIコンパイラのコマンド名でもコンパイルの実行が可能です。例えば「nvfortran」は「pgfortran」を置き換えています。注意点として、「pgcc」は「nvc」に置き換えられていることに注意してください。既存のCUDA C/C++コンパイラとして「nvcc」があるため、名前の衝突を避けています。下位互換性のために、「pgfortran」などの古いコマンドは内部的に「nvfortran」を呼び出し、無期限にサポートされます。また、旧PGI環境で作成された makefile内にPGIコマンド名の記載があっても問題なく、NVIDIA HPC SDK環境で実行できます。
新しいNVIDIA HPC SDK環境と過去の旧PGIコンパイラ環境の両方を実装しているシステム環境では、旧PGIコマンド（pgfortran, pgc++, pgcc）を使用する場合、PATH環境変数の設定の優先順序により、新しいNVIDIA HPC SDKを使用しているか、あるいは過去のPGIバージョンを使用しているかが、変わります。その際は、which コマンド（which pgf90など）やコンパイラの--versionオプションなどで、使用バージョンを確認してください。旧PGI環境のデフォルトインストールパスは、/opt/pgiでした。

言語コンパイラ	言語または機能	コマンド	旧PGIコマンド(当該コマンドでも利用可能)
NVFORTRAN	ISO / ANSI Fortran 2003	nvfortran	pgfortran,pgf90,pgf95
NVC++	ISO / ANSI C++ 17（GNU互換）	nvc++	pgc++
NVC	ISO / ANSI C11およびK＆R C	nvc	pgcc

コンパイル・オプション

NVIDIA HPC コンパイラのコンパイル・オプションは、旧PGIコンパイラで使用していたものと同一です。以下の資料では、コマンド名は pgfortran, pgc++, pgcc となっていますが、これを nvfortran, nvc++, nvc に代替することにより、同じオプション機能をNVIDIA HPCコンパイラで利用することができます。

1. hello.c, hollo.cpp, hello.f90 プログラムを作成します。

hello.c:
#include 
int main() {
printf("hello, world!\n");
return 0;
}

hello.cpp:
#include 
int main() {
std::cout << "hello, world!\n";
return 0;
}

hello.f90:
print *, "Hello, world!"
end

2. Fortranプログラム hello.f90 プログラムをコンパイルし実行します。

$ nvfortran -o f_hello hello.f90 -V

nvfortran 20.9-0 LLVM 64-bit target on x86-64 Linux -tp skylake
NVIDIA Compilers and Tools
Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
$ ./f_hello
 Hello, world!
$ which nvfortran
/opt/nvidia/hpc_sdk/Linux_x86_64/2020/compilers/bin/nvfortran

2. Cプログラム hello.c プログラムをコンパイルし実行します。

$ nvc -o atttttt hello.c
$ ./c_hello
hello, world!
$ which nvc
/opt/nvidia/hpc_sdk/Linux_x86_64/2020/compilers/bin/nvc

3. C++ プログラム hello.cpp プログラムをコンパイルし実行します。

$ nvc++ -o cpp_hello hello.cpp
$ ./cpp_hello
hello, world!
$ which nvc++
/opt/nvidia/hpc_sdk/Linux_x86_64/2020/compilers/bin/nvc++

実際のOpenACC Fortranプログラムをコンパイルし実行します。

$ which nvfortran
/opt/nvidia/hpc_sdk/Linux_x86_64/2020/compilers/bin/nvfortran

$ nvfortran -fast -O3 -Minfo -acc -ta=tesla,cc60 -flags (コンパイルのオプションフラグの意味を確認する -flags)
Reading rcfile /opt/nvidia/hpc_sdk/Linux_x86_64/20.9/compilers/bin/.nvfortranrc
-fast               Common optimizations; includes -O2 -Munroll=c:1 -Mlre -Mautoinline
                    == -Mvect=simd -Mflushz -Mcache_align
-M[no]vect[=[no]altcode|[no]assoc|cachesize:|[no]fuse|[no]gather|[no]idiom|levels:
                    |nocond|[no]partial|prefetch|[no]short|[no]simd[:{128|256}]
                    |[no]simdresidual|[no]sizelimit[:n]|[no]sse|[no]tile]
                    Control automatic vector pipelining
    [no]altcode     Generate appropriate alternative code for vectorized loops
    [no]assoc       Allow [disallow] reassociation
    cachesize:   Optimize for cache size c
    [no]fuse        Enable [disable] loop fusion
    [no]gather      Enable [disable] vectorization of indirect array references
    [no]idiom       Enable [disable] idiom recognition
    levels:      Maximum nest level of loops to optimize
    nocond          Disable vectorization of loops with conditionals
    [no]partial     Enable [disable] partial loop vectorization via inner loop distribution
    prefetch        Generate prefetch instructions
    [no]short       Enable [disable] short vector operations
    [no]simd[:{128|256}]
                    Generate [don't generate] SIMD instructions
     128            Use 128-bit SIMD instructions
     256            Use 256-bit SIMD instructions
     512            Use 512-bit SIMD instructions
    [no]simdresidual
                    Enable [disable] vectorization of the residual loop of a vectorized loop
    [no]sizelimit[:n]
                    Limit size of vectorized loops
    [no]sse         The [no]sse option is deprecated, use [no]simd instead.
    [no]tile        Enable [disable] loop tiling
-M[no]flushz        Set SSE to flush-to-zero mode
-Mcache_align       Align large objects on cache-line boundaries
-O3                 Set opt level. All -O2 optimizations plus more aggressive code hoisting and scalar replacement, 
                    that may or may not be profitable, performed
                    == -Mvect=simd
-M[no]vect[=[no]altcode|[no]assoc|cachesize:|[no]fuse|[no]gather|[no]idiom|levels:|nocond|[no]partiali
                    |prefetch|[no]short|[no]simd[:{128|256}]|[no]simdresidual|[no]sizelimit[:n]|[no]sse|[no]tile]
                    Control automatic vector pipelining
    [no]altcode     Generate appropriate alternative code for vectorized loops
    [no]assoc       Allow [disallow] reassociation
    cachesize:   Optimize for cache size c
    [no]fuse        Enable [disable] loop fusion
    [no]gather      Enable [disable] vectorization of indirect array references
    [no]idiom       Enable [disable] idiom recognition
    levels:      Maximum nest level of loops to optimize
    nocond          Disable vectorization of loops with conditionals
    [no]partial     Enable [disable] partial loop vectorization via inner loop distribution
    prefetch        Generate prefetch instructions
    [no]short       Enable [disable] short vector operations
    [no]simd[:{128|256}]
                    Generate [don't generate] SIMD instructions
     128            Use 128-bit SIMD instructions
     256            Use 256-bit SIMD instructions
     512            Use 512-bit SIMD instructions
    [no]simdresidual
                    Enable [disable] vectorization of the residual loop of a vectorized loop
    [no]sizelimit[:n]
                    Limit size of vectorized loops
    [no]sse         The [no]sse option is deprecated, use [no]simd instead.
    [no]tile        Enable [disable] loop tiling
-M[no]info[=all|accel|ftn|inline|intensity|ipa|loop|lre|mp|opt|par|pcast|pfo|stat|time|vect]
                    Generate informational messages about optimizations
    all             -Minfo=accel,inline,ipa,loop,lre,mp,opt,par,vect
    accel           Enable Accelerator information
    ftn             Enable Fortran-specific information
    inline          Enable inliner information
    intensity       Enable compute intensity information
    ipa             Enable IPA information
    loop            Enable loop optimization information
    lre             Enable LRE information
    mp              Enable OpenMP information
    opt             Enable optimizer information
    par             Enable parallelizer information
    pcast           Enable PCAST information
    pfo             Enable profile feedback information
    stat            Same as -Minfo=time
    time            Display time spent in compiler phases
    vect            Enable vectorizer information
-[no]acc[=gpu|host|multicore|[no]autopar|[no]routineseq|legacy|strict|verystrict|sync|[no]wait]
                    Enable OpenACC directives
    gpu             OpenACC directives are compiled for GPU execution only; 
                    please refer to -gpu for target specific options
    host            Compile for serial execution on the host CPU
    multicore       Compile for parallel execution on the host CPU
    [no]autopar     Enable (default) or disable loop autoparallelization within acc parallel
    [no]routineseq  Compile every routine for the device
    legacy          Suppress warnings about deprecated PGI accelerator directives
    strict          Issue warnings for non-OpenACC accelerator directives
    verystrict      Fail with an error for any non-OpenACC accelerator directive
    sync            Ignore async clauses
    [no]wait        Wait for each device kernel to finish
-ta=host|multicore|tesla
                    Choose target accelerator (supported only for OpenACC, DEPRECATED please refer to -acc and -gpu)
    host            Compile for serial execution on the host CPU
    multicore       Compile for parallel execution on the host CPU
    tesla           Compile for parallel execution on a Tesla GPU

---- GPU 用の並列実行のモジュールを生成する----------------
$ nvfortran -fast -O3 -Minfo -acc=gpu -ta=tesla,cc60 himenobench_kernels.F90 (OpenACC GPUコードを生成）
himenobmtxp_f90:
     99, Generating create(a(:,:,:,:),bnd(:,:,:),b(:,:,:,:)) [if not already present]
         Generating copyout(p(:,:,:)) [if not already present]
         Generating create(wrk2(:,:,:),wrk1(:,:,:),c(:,:,:,:)) [if not already present]
initmt:
    266, Generating present(a(:,:,:,:),bnd(:,:,:),c(:,:,:,:),b(:,:,:,:),p(:,:,:),wrk2(:,:,:),wrk1(:,:,:))
    267, Generating Tesla code
        268, !$acc loop gang ! blockidx%x
        269, !$acc loop seq
        270, !$acc loop vector(128) ! threadidx%x
    269, Loop is parallelizable
    270, Loop is parallelizable
    298, Generating Tesla code
        299, !$acc loop gang ! blockidx%x
        300, !$acc loop seq
        301, !$acc loop vector(128) ! threadidx%x
    300, Loop is parallelizable
    301, Loop is parallelizable
jacobi:
    362, Generating present(a(:,:,:,:),bnd(:,:,:),c(:,:,:,:),b(:,:,:,:),p(:,:,:),wrk2(:,:,:),wrk1(:,:,:))
    365, Loop not vectorized/parallelized: contains call
    368, Generating implicit copy(gosa) [if not already present]
    370, Loop is parallelizable
    371, Loop is parallelizable
    372, Loop is parallelizable
         Generating Tesla code
        370, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
             Generating implicit reduction(+:gosa)
        371,   ! blockidx%x threadidx%x collapsed
        372,   ! blockidx%x threadidx%x collapsed
    393, Loop is parallelizable
    394, Loop is parallelizable
    396, Loop is parallelizable
         Generating Tesla code
        393, !$acc loop gang ! blockidx%y
        394, !$acc loop gang ! blockidx%x
        396, !$acc loop gang, vector(256) ! blockidx%z threadidx%x

$ ./a.out (実行する）
 For example:
 Grid-size=
            XS (64x32x32)
            S  (128x64x64)
            M  (256x128x128)
            L  (512x256x256)
            XL (1024x512x512)
  Grid-size =
 initialize nvidia GPU
  mimax=          257  mjmax=          129  mkmax=          129
  imax=          256  jmax=          128  kmax=          128
  Time measurement accuracy : .10000E-05
  Start rehearsal measurement process.
  Measure the performance in 3 times.
   MFLOPS:    12349.59       time(s):   3.3305999999999995E-002   1.6939556E-03
 Now, start the actual measurement process.
 The loop will be excuted in          800  times.
 This will take about one minute.
 Wait for a while.
  Loop executed for           800  times
  Gosa :   8.3822059E-04
  MFLOPS:    57598.34       time(s):    1.904293000000000
  Score based on Pentium III 600MHz :    695.2963

---- マルチコア CPU の並列実行のモジュールを生成する----------------
$ nvfortran -fast -O3 -Minfo -acc=multicore  himenobench_kernels.F90
initmt:
    267, Generating Multicore code
        268, !$acc loop gang
    269, Loop is parallelizable
    270, Loop is parallelizable
         Generated vector simd code for the loop
    298, Generating Multicore code
        299, !$acc loop gang
    300, Loop is parallelizable
    301, Loop is parallelizable
         Generated vector simd code for the loop
jacobi:
    365, Loop not vectorized/parallelized: too deeply nested
    370, Loop is parallelizable
         Generating Multicore code
        370, !$acc loop gang
    370, Generating implicit reduction(+:gosa)
         FMA (fused multiply-add) instruction(s) generated
    371, Loop is parallelizable
    372, Loop is parallelizable
         Generated vector simd code for the loop containing reductions
         FMA (fused multiply-add) instruction(s) generated
    393, Loop is parallelizable
         Generating Multicore code
        393, !$acc loop gang
    394, Loop is parallelizable
    396, Loop is parallelizable
         Memory copy idiom, loop replaced by call to __c_mcopy4

$ ./a.out  (実行）
 For example:
 Grid-size=
            XS (64x32x32)
            S  (128x64x64)
            M  (256x128x128)
            L  (512x256x256)
            XL (1024x512x512)
  Grid-size =
 initialize nvidia GPU
  mimax=          257  mjmax=          129  mkmax=          129
  imax=          256  jmax=          128  kmax=          128
  Time measurement accuracy : .10000E-05
  Start rehearsal measurement process.
  Measure the performance in 3 times.
   MFLOPS:    8282.797       time(s):   4.9658999999999995E-002   1.6941053E-03
 Now, start the actual measurement process.
 The loop will be excuted in          800  times.
 This will take about one minute.
 Wait for a while.
  Loop executed for           800  times
  Gosa :   8.3817181E-04
  MFLOPS:    8511.887       time(s):    12.88599300000000
  Score based on Pentium III 600MHz :    102.7509