Post

CUDA utils

This is my first post in CUDA, I have been working for a while using this technology and want to share some utilities that can be useful for newcomers to the field. All the code will be available in my github repository subdirectory cuda-utils. I would like to acqnowledge Lambda.ai for providing me with free credits for my blog. I will be testing this code with a machine gpu_1x_a100_sxm4 which has an A100 GPU, the Ampere GPU architecture. This GPU is a bit old these days but we won’t be doing any heavy compute so this will suffice.

Project structure

Files in this project will be structured as follows

1
2
3
4
5
6
.
├── CMakeLists.txt
├── README.md
└── src
    ├── gpu_allocate.cu
    └── gpu_info.cu

GPU info tool

The first tool just displays some basic information of the GPUs available in the system, create a file gpu_info.cu in with the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#include <iostream>
#include <cuda_runtime.h>

int main() {
    int deviceCount = 0;
    cudaError_t err = cudaGetDeviceCount(&deviceCount);
    if (err != cudaSuccess) {
        std::cerr << "Failed to get device count: " 
                  << cudaGetErrorString(err) << std::endl;
        return 1;
    }

    std::cout << "Detected " << deviceCount << " CUDA Capable Device(s)\n\n";

    for (int dev = 0; dev < deviceCount; ++dev) {
        // Select device
        cudaSetDevice(dev);

        // Query device properties
        cudaDeviceProp prop;
        cudaGetDeviceProperties(&prop, dev);

        // Query memory info
        size_t freeBytes = 0, totalBytes = 0;
        cudaMemGetInfo(&freeBytes, &totalBytes);

        std::cout << "Device " << dev << ": " << prop.name << "\n";
        std::cout << "  PCI Domain/Bus/Device ID: " 
                  << prop.pciDomainID << "/" 
                  << prop.pciBusID    << "/" 
                  << prop.pciDeviceID << "\n";
        std::cout << "  Compute capability: " 
                  << prop.major << "." << prop.minor << "\n";
        std::cout << "  Total global memory: " 
                  << (prop.totalGlobalMem  / (1024.0 * 1024.0)) << " MB\n";
        std::cout << "  Free memory (current): " 
                  << (freeBytes  / (1024.0 * 1024.0)) << " MB\n";
        std::cout << "  Total allocatable memory (current): " 
                  << (totalBytes / (1024.0 * 1024.0)) << " MB\n";
        std::cout << "  Memory clock rate: " 
                  << (prop.memoryClockRate * 1e-3) << " MHz\n";
        std::cout << "  Memory bus width: " 
                  << prop.memoryBusWidth << " bits\n";
        std::cout << "  L2 cache size: " 
                  << prop.l2CacheSize / 1024 << " KB\n";
        std::cout << "  Max shared memory per block: " 
                  << prop.sharedMemPerBlock / 1024 << " KB\n";
        std::cout << "  Total constant memory: " 
                  << prop.totalConstMem / 1024 << " KB\n";
        std::cout << "  Warp size: " 
                  << prop.warpSize << "\n";
        std::cout << "  Max threads per block: " 
                  << prop.maxThreadsPerBlock << "\n";
        std::cout << "  Max threads per multiprocessor: " 
                  << prop.maxThreadsPerMultiProcessor << "\n";
        std::cout << "  Multiprocessor count: " 
                  << prop.multiProcessorCount << "\n";
        std::cout << "  Max grid dimensions: [" 
                  << prop.maxGridSize[0] << ", " 
                  << prop.maxGridSize[1] << ", " 
                  << prop.maxGridSize[2] << "]\n";
        std::cout << "  Max block dimensions: [" 
                  << prop.maxThreadsDim[0] << ", " 
                  << prop.maxThreadsDim[1] << ", " 
                  << prop.maxThreadsDim[2] << "]\n";
        std::cout << "  Clock rate: " 
                  << (prop.clockRate * 1e-3) << " MHz\n";
        std::cout << "  Concurrent kernels: " 
                  << (prop.concurrentKernels ? "Yes" : "No") << "\n";
        std::cout << "  ECC enabled: " 
                  << (prop.ECCEnabled ? "Yes" : "No") << "\n";
        std::cout << "  Integrated device: " 
                  << (prop.integrated ? "Yes" : "No") << "\n";
        std::cout << "  Can map host memory: " 
                  << (prop.canMapHostMemory ? "Yes" : "No") << "\n";
        std::cout << "  Compute mode: ";
        switch (prop.computeMode) {
            case cudaComputeModeDefault:      std::cout << "Default\n"; break;
            case cudaComputeModeExclusive:    std::cout << "Exclusive\n"; break;
            case cudaComputeModeProhibited:   std::cout << "Prohibited\n"; break;
            case cudaComputeModeExclusiveProcess:
                                              std::cout << "Exclusive Process\n"; break;
            default:                          std::cout << "Unknown\n"; break;
        }
        std::cout << "  Unified addressing: " 
                  << (prop.unifiedAddressing ? "Yes" : "No") << "\n";
        std::cout << "  Async engines: " 
                  << prop.asyncEngineCount << "\n";
        std::cout << "  Device overlap: " 
                  << (prop.deviceOverlap ? "Yes" : "No") << "\n";
        std::cout << "  PCI bus ID: " 
                  << prop.pciBusID << "\n";
        std::cout << "  PCI device ID: " 
                  << prop.pciDeviceID << "\n";
        std::cout << "\n";
    }

    return 0;
}

The main ingredient is cudaDeviceProp a struct defined in cuda_runtime.h (see documentation here) that contains properites of the devices. Before printing out on screen properties we count the devices and then loop over all of them to print out he properites using the device propery variable. Let’s see what is the output of an A100 gpu from lambda.ai:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Detected 1 CUDA Capable Device(s)

Device 0: NVIDIA A100-PCIE-40GB
  PCI Domain/Bus/Device ID: 0/7/0
  Compute capability: 8.0
  Total global memory: 40442.4 MB
  Free memory (current): 40019.6 MB
  Total allocatable memory (current): 40442.4 MB
  Memory clock rate: 1215 MHz
  Memory bus width: 5120 bits
  L2 cache size: 40960 KB
  Max shared memory per block: 48 KB
  Total constant memory: 64 KB
  Warp size: 32
  Max threads per block: 1024
  Max threads per multiprocessor: 2048
  Multiprocessor count: 108
  Max grid dimensions: [2147483647, 65535, 65535]
  Max block dimensions: [1024, 1024, 64]
  Clock rate: 1410 MHz
  Concurrent kernels: Yes
  ECC enabled: Yes
  Integrated device: No
  Can map host memory: Yes
  Compute mode: Default
  Unified addressing: Yes
  Async engines: 3
  Device overlap: Yes
  PCI bus ID: 7
  PCI device ID: 0

It tells us the memory (global) is around 40GB and it is mostly free. Warp size is 32, which is quite usual in many architectures. Maximum threads per block 1024, also very common, and maximum block dimensions [1024, 1024, 64]. I like this tool just to know my limits when I code cuda kernels (a high level API to interact with the Nvidia card).

GPU allocate tool

This tool is a bit different, it can be used to block a chunk of gpu memory and serves as a hello wolrd example on how to code basic cuda. Write into gpu_allocate.cu the content:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
#include <cuda_runtime.h>
#include <iostream>
#include <string>
#include <cstdlib>
#include <chrono>
#include <thread>

// Helper to parse size strings like 1024, 100M, 2G, etc.
size_t parseSize(const std::string& s) {
    char unit = s.back();
    std::string num = s;
    size_t multiplier = 1;
    if (unit == 'K' || unit == 'k') {
        multiplier = 1024ULL;
        num = s.substr(0, s.size() - 1);
    } else if (unit == 'M' || unit == 'm') {
        multiplier = 1024ULL * 1024ULL;
        num = s.substr(0, s.size() - 1);
    } else if (unit == 'G' || unit == 'g') {
        multiplier = 1024ULL * 1024ULL * 1024ULL;
        num = s.substr(0, s.size() - 1);
    }
    return static_cast<size_t>(std::stoull(num) * multiplier);
}

// Helper to parse time strings like 10s, 5m, 1h, or raw seconds
long parseTime(const std::string& s) {
    char unit = s.back();
    std::string num = s;
    long multiplier = 1;
    if (unit == 's' || unit == 'S') {
        multiplier = 1;
        num = s.substr(0, s.size() - 1);
    } else if (unit == 'm' || unit == 'M') {
        multiplier = 60;
        num = s.substr(0, s.size() - 1);
    } else if (unit == 'h' || unit == 'H') {
        multiplier = 3600;
        num = s.substr(0, s.size() - 1);
    }
    return static_cast<long>(std::stol(num) * multiplier);
}

int main(int argc, char* argv[]) {
    if (argc != 4) {
        std::cerr << "Usage: " << argv[0] << " <gpu_id> <memory_amount (e.g., 512M, 1G, or bytes)> <duration (e.g., 10s, 5m, 1h)>" << std::endl;
        return EXIT_FAILURE;
    }

    int gpuId = std::stoi(argv[1]);
    size_t bytes = parseSize(argv[2]);
    long duration = parseTime(argv[3]);

    cudaError_t err = cudaSetDevice(gpuId);
    if (err != cudaSuccess) {
        std::cerr << "Error setting GPU device " << gpuId << ": " << cudaGetErrorString(err) << std::endl;
        return EXIT_FAILURE;
    }

    void* d_ptr = nullptr;
    err = cudaMalloc(&d_ptr, bytes);
    if (err != cudaSuccess) {
        std::cerr << "Error allocating " << bytes << " bytes on GPU " << gpuId
                  << ": " << cudaGetErrorString(err) << std::endl;
        return EXIT_FAILURE;
    }

    std::cout << "Successfully allocated " << bytes << " bytes on GPU " << gpuId
              << ", holding for " << duration << " seconds..." << std::endl;

    // Keep the allocation alive for the specified duration
    std::this_thread::sleep_for(std::chrono::seconds(duration));

    // Free the allocation and exit
    cudaFree(d_ptr);
    std::cout << "Freed memory and exiting." << std::endl;
    return EXIT_SUCCESS;
}

Let’s take a look at the main, it is a command line tool whith three inputs in the argv argument, the gpuId, the number of bytes and the duration in seconds. Basically we want to allocate a number of bytes in a specific gpu during a certain ammount of time. The next part is sleecting the gpu with cudaSetDevice function and allocate the memory with cudaMalloc(&d_ptr, bytes) where d_ptr is a pointer to void. Then on the cpu side we tell it to sleep for the ammount of seconds we selected with std::this_thread::sleep_for(std::chrono::seconds(duration)), and finally after that time is elapsed we dealocate the memory with cudaFree and exit the program with success code. The functions ParseSize and ParseTime are just two helpers to match the sizes kilobytes, megabytes, gigabytes to bytes and the times hours, minutes, seconds to seconds.

The CMakeLists.txt file

Cmake is a super powerful command line tool that creates a make for your project. It is very convenient in C++ and CUDA projects. Write a CMakeLists.txt file with this content

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
cmake_minimum_required(VERSION 3.10)
project(GPUTools LANGUAGES CXX CUDA)

# default exec names
set(GPU_INFO_OUT_NAME    "gpu_info"
    CACHE STRING "Name of the gpu_info executable")
set(GPU_ALLOC_OUT_NAME   "gpu_allocate"
    CACHE STRING "Name of the gpu_allocate executable")

# Restore old FindCUDA behavior if needed
if(POLICY CMP0146)
  cmake_policy(SET CMP0146 OLD)
endif()

# Language standards
set(CMAKE_CXX_STANDARD      14)
set(CMAKE_CXX_STANDARD_REQUIRED ON)

set(CMAKE_CUDA_ARCHITECTURES 80 CACHE STRING
    "List of CUDA architectures to build for (e.g. 61;70;75;86)")

# Find CUDA (for older CMake) or you can use find_package(CUDAToolkit) in 3.17+
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})

add_executable(gpu_info
  src/gpu_info.cu
)
target_link_libraries(gpu_info
  PRIVATE ${CUDA_CUDART_LIBRARY}
)
set_target_properties(gpu_info
  PROPERTIES OUTPUT_NAME ${GPU_INFO_OUT_NAME}
)

add_executable(gpu_allocate
  src/gpu_allocate.cu
)
target_link_libraries(gpu_allocate
  PRIVATE ${CUDA_CUDART_LIBRARY}
)
set_target_properties(gpu_allocate
  PROPERTIES OUTPUT_NAME ${GPU_ALLOC_OUT_NAME}
)

# (Optional) If you want to give a different on-disk name:
# set(EXE_NAME alloc_mem)
# set_target_properties(allocate_gpu_memory PROPERTIES OUTPUT_NAME ${EXE_NAME})

# Installation
install(TARGETS
  gpu_info
  gpu_allocate
  RUNTIME DESTINATION bin
)

There are two variables that are set by default but can be changed when we call cmake command line. GPU_INFO_OUT_NAME and GPU_ALOC_OUT_NAME, those two are the names of the executables. We set the C++ standard and the CMAKE_CUDA_ARCHITECTURES is the architecture of the GPU we are compiling for:

GPU ArchitectureNVCC Arch FlagCompute CapabilityExample GPUs
Keplersm_303.0GTX 780, Tesla K20
Maxwellsm_505.0GTX 970, Tesla M60
Pascalsm_606.0GTX 1080, Tesla P100
Voltasm_707.0Tesla V100
Turingsm_757.5RTX 2080, Quadro RTX 6000
Ampere (A100)sm_808.0A100, RTX A6000
Ampere (GA10x)sm_868.6RTX 3090, 3080, 3070, A10
Ada Lovelacesm_898.9RTX 4090, 4080
Hoppersm_909.0H100

In our case the arcithectures is defined by Compute capability: 8.0 from the information printed on screen in the previous section. This is, our GPU is an A100. A general solution is to set set(CMAKE_CUDA_ARCHITECTURES) all CACHE STRING "Target all architectures) making the code compatible with any card but this increases compilation time and also is slower at runtime. For modern GPUs you can do set(CMAKE_CUDA_ARCHITECTURES 75 80 86 89 CACHE STRING "Target common modern architectures"). Then we need to find the cuda libraries with

1
2
find_package(CUDA REQUIRED)
include_directories(${CUDA_INCLUDE_DIRS})

and adds the headers to the project. For cmake>3.17 we would only need to define project(GPUTools LANGUAGES CXX CUDA) without the need to even include the cuda headers through include_directories. Finally we tell cmake which are the executables to be compiled, the libraries to link and the executable name. Then just the install instruction with the two executables.

To compile the two executables we need to run

1
2
3
rm -rf build && mkdir build && cd build
cmake ..
cmake --build .

which creates them in build directory. But, if you want to make them execcutable in all the system by installing them in $HOME/.local/bin you can do

1
2
3
4
5
6
7
8
9
rm -rf build && mkdir build && cd build
cmake \
  -DGPU_INFO_OUT_NAME=gpu_info \
  -DGPU_ALLOC_OUT_NAME=gpu_allocate \
  -DCMAKE_CUDA_ARCHITECTURES="70;75;80" \
  -DCMAKE_INSTALL_PREFIX=${HOME}/.local \
  ..
cmake --build .
cmake --install .

where GPU_INFO_OUT_NAME and GPU_ALLOC_OUT_NAME are the names of the executables (should we want to change them), CMAKE_CUDA_ARCHITECTURES are the GPU architectures we want to compile for. And CMAKE_INSTALL_PREFIX the install directory. In this last one even though we set it to ${HOME}/.local, the binaries will be installed in ${HOME}/.local/bin since we have the condition RUNTIME destination bin in the cmake.

Bonus: nvitop

nvitop is a great tool to monitor your GPU. I personally like it better than nvidia-smi which is the nvidia default “top”. This tool comes in a python package so to install it it’s best to create a new python virtual environment. We have covered this before in this blog so I am not going to extend. Just create a virtual envirionment in $HOME/.venvs called nvitop and pip install the tool:

1
2
3
mkdir -p $HOME/.venvs
python -m venv $HOME/.venvs/nvitop
$HOME/.venvs/nvitop/bin/pip install nvitop

Now you can exectute nvitop with

1
$HOME/.venvs/nvitop/bin/nvitop

or activating your environment and running nvitop in the command line. It is better to create a symlink to ${HOME}/.local/bin directory so that the command line is in your $PATH:

1
ln -s  $HOME/.venvs/nvitop/bin/nvitop $HOME/.local/bin/nvitop

Seems that the python executable nvitop is not platform specific as the build wheels i find currently for the most recent version 1.5.1 in PyPi is nvitop-1.5.1-py3-none-any.whl. So this wheel should work for ARM64 (New generation of Grace Hoppers and Blackwell with GPU and CPU integrated) as well as for x86 CPU architectures.

Now we have the execs in ${HOME}/.local/bin that should be in our $PATH.

Conclusion

Hope you liked these tools I built, so far they have been useful for me. I will probably build more in the near future so I will post again about this.

This post is licensed under CC BY 4.0 by the author.