2024年09月09日 / 最終更新日時 : 2024年09月09日 hpcworldNEWS 技術情報

NVIDIA H100のスレッドブロッククラスタと分散共有メモリを使ってみる

はじめに

プロメテックの南です。

今回は、NVIDIA GPUのcompute capability 9.0(Hopper世代)にて新たに導入された分散共有メモリ(Distributed Shared Memory) を紹介します。

分散共有メモリとは

CUDA 12より、従来のCUDA3階層(グリッド、スレッドブロック、スレッド)に対して、スレッドブロッククラスタ(Thread Block Cluster, 以下クラスタ)　という新たな階層が加わり、CUDA4階層となりました。
NVIDIAの資料によると*1、クラスタの機能は、

クラスタ内ブロックは、同時にスケジューリング

分散共有メモリ: 互いの共有メモリにアクセス可

高速なクラスタ内同期(HWサポート)

とのことです。
複数のブロックをクラスタとして扱うことで、高性能・高効率化を図る仕組みのようです。

以下に、ハードウェアとCUDAプログラミングモデルの対応を示します。

CUDA階層	Grid	Thread Block Cluster	Thread Block	Thread
ハードウェア	GPU	GPC (Graphics Processing Clusters)	SM (Streaming Multiprocessor)	CUDA Core

CUDA 12からは、GPC(複数のSM)をクラスタという単位で利用することができるようになり、その具体的機能の一つが分散共有メモリです。
これはGPC内(クラスタ内)であれば、各SMが持つ共有メモリ(Shared Memory)にアクセスすることができるという機能です。

スレッドブロッククラスタのHello, World!

では、さっそくクラスタを使ってみます。
CUDAではクラスタを利用するためのAPIが用意されています*2。

Hello, World! は以下のような感じでしょうか。
cooperative groupを用いて、グリッドおよびクラスタの情報を取得することができます。
ソースコード全文はこちらをご参照ください*3。


$ cat cluster_hello.cu
...
namespace cg = cooperative_groups;
__global__ void hello_kernel() {
    // グリッド - クラスタ - ブロック - スレッドの情報取得
    cg::grid_group grid = cg::this_grid();
    cg::cluster_group cluster = cg::this_cluster();
    int cluster_rank = grid.cluster_rank();  // クラスタのランク
    int cluster_size = grid.num_clusters();  // クラスタ数
    int block_rank = cluster.block_rank();   // ブロックID（ローカル）
    int block_idx = grid.block_rank();       // ブロックID（グローバル）
    int num_blocks = grid.dim_blocks().x;    // ブロック数
    int num_threads = blockDim.x;            // スレッド数
    int thread_rank = grid.thread_rank();    // スレッドID（グローバル）
    int thread_idx = threadIdx.x;            // スレッドID（ローカル）
    // 個々のブロックにて、代表スレッドで出力
    if (thread_idx == 0) {
        printf("Hello, World from Cluster %d of %d, Block(G) %d, Block(L) %d, Thread(G) %d, Thread(L) %d!\n", \
                cluster_rank,
                cluster_size,
                block_idx,
                block_rank,
                thread_rank,
                thread_idx);
    }
    // クラスタ内同期
    cluster.sync();
    // クラスタ内の代表スレッドで出力
    if (block_rank == 0 && thread_idx == 0) {
        printf("All blocks in Cluster %d have said hello!\n", cluster_rank);
    }
}
...

実行してみます。使用するコンパイラはNVHPC 24.7です。


$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Thu_Jun__6_02:18:23_PDT_2024
Cuda compilation tools, release 12.5, V12.5.82
Build cuda_12.5.r12.5/compiler.34385749_0

$ nvcc -o cluster_hello cluster_hello.cu -arch=sm_90

$ ./cluster_hello
Hello, World from Cluster 0 of 2, Block(G) 0, Block(L) 0, Thread(G) 0, Thread(L) 0!
Hello, World from Cluster 0 of 2, Block(G) 1, Block(L) 1, Thread(G) 32, Thread(L) 0!
Hello, World from Cluster 1 of 2, Block(G) 2, Block(L) 0, Thread(G) 64, Thread(L) 0!
Hello, World from Cluster 1 of 2, Block(G) 3, Block(L) 1, Thread(G) 96, Thread(L) 0!
All blocks in Cluster 0 have said hello!
All blocks in Cluster 1 have said hello!

ちゃんとクラスタが動いていそうですね。

分散共有メモリを使う

分散共有メモリは、クラスタ内であればブロックをまたいで共有メモリにアクセスすることができる機能です。
したがって、共有メモリとして扱えるデータ量がクラスタ内のブロック数に応じて大きくなります。
使いどころとしては、ブロック単体では共有メモリ上で処理できない量のデータでも、クラスタであれば処理できるようなケースでしょう。

今回は、公式ドキュメントでも紹介されている、ヒストグラム計算を取り上げます*2。
実装はこちらをご参照ください*3。


$ nvaccelinfo
...
Device Name:                   NVIDIA H100 PCIe   // H100
...
Total Shared Memory per Block: 49152   // 1ブロックの共有メモリ容量は48KB
...

$ cat cluster_histogram.cu
...
#define NUM_BINS 24000               // int32の場合、94KB。1ブロックの共有メモリには収まらないが、2ブロックなら収まる
#define NUM_ELEMENTS 100000000

__global__ void histogram_kernel(int* input, int input_size, int nbins_per_block, int* output) {
    cg::grid_group grid = cg::this_grid();
    cg::cluster_group cluster = cg::this_cluster();
    int cluster_rank = grid.cluster_rank();  // クラスタのランク
    int cluster_size = grid.num_clusters();  // クラスタ数
    int block_rank = cluster.block_rank();   // ブロックID（ローカル）
    int block_idx = grid.block_rank();       // ブロックID（グローバル）
    int num_blocks = grid.dim_blocks().x;    // ブロック数
    int num_threads = blockDim.x;            // スレッド数
    int thread_rank = grid.thread_rank();    // スレッドID（グローバル）
    int thread_idx = threadIdx.x;            // スレッドID（ローカル）
    // 共有メモリの初期化
    extern __shared__ int smem[];
    for (int i = thread_idx; i < nbins_per_block; i += num_threads) {
      smem[i] = 0;
    }
    cluster.sync();
    // 共有メモリ上でヒストグラム計算
    for (int i = thread_rank; i < input_size; i += num_blocks * num_threads) {
      int bin_id = input[i];
      int dst_rank = bin_id / nbins_per_block;                  // クラスタ内のランク
      int dst_offset = bin_id % nbins_per_block;                // 共有メモリのオフセット
      int *dst_smem = cluster.map_shared_rank(smem, dst_rank);  // クラスタ内の共有メモリのアドレスを取得
      atomicAdd(dst_smem + dst_offset, 1);
    }
    cluster.sync();
    // グローバルメモリへの書き込み
    int *dist_output = output + block_rank * nbins_per_block;
    for (int i = thread_idx; i < nbins_per_block; i += num_threads) {
      atomicAdd(&dist_output[i], smem[i]);
    }
    cluster.sync();
}
...
int main() {
    ...
    // カーネル起動の準備
    const int cluster_size = 2;           // クラスタサイズ。1, 2, 4, 8のいずれか
    int nbins_per_block = NUM_BINS/cluster_size;   // 各ブロックの共有メモリに保持するヒストグラムの数
    config.dynamicSmemBytes = nbins_per_block * sizeof(int);  // 各ブロックの共有メモリのサイズ
    ...

    // カーネル起動
    CUDA_CHECK(cudaLaunchKernelEx(&config, histogram_kernel, d_input, NUM_ELEMENTS, nbins_per_block, d_output));
    ...
}

さて、実行してみましょう。


$ nvcc -o cluster_histogram cluster_histogram.cu -arch=sm_90

$ ./cluster_histogram
Total count: 100000000 (should be 100000000)
Kernel execution time: 1.110144 milliseconds

$ ./simple_histogram          // 比較用。共有メモリを使わない場合
Total count: 100000000 (should be 100000000)
Kernel execution time: 1.744928 milliseconds

共有メモリを使用しない場合に比べて、 1.744928 / 1.110144 = 1.57倍程度の性能向上が得られていそうです。

まとめ

本記事では、NVIDIA GPU Hopper世代にて新たに導入されたスレッドブロッククラスタ(Thread Block Cluster)および分散共有メモリ(Distributed Shared Memory) を紹介しました。

出典

Hopper アーキテクチャで、変わること、変わらないこと
CUDA C++ Programming Guide
H100 Cluster の実装例

カテゴリー: 技術情報

タグ: CUDA GPU

HPC技術コラム

NVIDIA H100のスレッドブロッククラスタと分散共有メモリを使ってみる

はじめに

分散共有メモリとは

スレッドブロッククラスタのHello, World!

分散共有メモリを使う

まとめ

出典

MPIによる分散通信を試す

OpenACCのworker並列を使ってプログラムを高速化する