Prefetching Memory in CUDA

Threads, Blocks and Grids

A thread is a single "process" on GPU. Any given GPU kernel can use blocks of threads, grouped into a grid of blocks. A kernel is executed as a grid of blocks of threads. Each block is run by a single Streaming Multiprocessor (SM) and in most usual, single-node cases can't be migrated to other SMs. One SM may execute several CUDA blocks concurrently.

Paging

Paging is a memory-management technique which allows a process's physical address space to be non-contiguous. Paging prevents two main problems:

External memory fragmentation and
the associate need for contraction

Paging is usually accomplished by breaking physical memory into fixed-sized blocks called frames, and breaking logical memory into blocks of the same size called pages. When a process is run, its pages are loaded from secondary memory (file system or backing store) into the memory page. The most interesting aspect of this is that it allows a process to have a logical 64-bit address space, although the system has less than $2^{64}$ bytes of physical memory.

Page Faults

A page fault occurs when a process requests tries to access a page that wasn't brought into memory (whether it be device or host). The paging hardward will notice that an invalid bit is set, and goes on to execute a straightforward procedure to handle this:

Check an internal table for the process to determine whether the reference itself was valid
If the reference was invalid, terminate the process, else we page in the data
Find a free frame in physical memory (a frame is a fixed-size collection of blocks that is indexed in physical memory)
Schedule a secondary storage operation to load the needed page into the newly allocated frame
When the read is complete, the internal table kept with the process and the page table is modified to indicate that the page is now in memory
The instruction is restarted

There are two main advantages of GPU page-faulting:

the CUDA system doesn't need to sync all managed memory allocations to GPU before each kernel since faulting causes automatic migration
page mapped to GPU addess space

The above process takes non-zero time, and a series of page-faults can result in significant memory overhead for any CUDA kernel.

Unified Memory

The CUDA programming model streamlines kernel development by implementing Unified Memory (UM) access, eliminating the need for explicit data movement via cudaMemcp*(). This is since the UM model enables all processes to see a coherent memory image with a common address address, where explicit calls to memory movement is handled by CUDA.

UM is for writing streamlined code, and does not necessarily result in a speed increase.

More significantly, non-explicit allocation of memory resources may result in a large amout of page-faulting procedures which go on during the kernel execution. However, non-explicit allocation of memory resources may result in a large amout of page-faulting procedures which go on during the kernel execution.

According to the CUDA Performance tuning guidelines:

Faults should be avoided: fault handling takes a while since it may include TLB invalidates, data migrations and page table updates
Data should be local to the access processor to minimize memory access latency and maximize bandwidth
Overhead of migration may exceed the benefits of locality if data is constantly migrated

Hence we can not use UM, since the UM drives can't detect common access patterns and optimize around it. WHen access patterns are non-obvious, it needs some guidance

Prefetching

Data prefetching is moving data to a processor's main memory and creating the mapping the page tables BEFORE data processing begins with the aim to avoid faults and establish locality.

cudaMemPrefetchhAsync prefetches memory to the specified destination device

    cudaError_t cudaMemPrefetchAsync(
        const void *devPtr,     // memory region
        size_t count,           // number of bytes
        inst dstDevice,         // device ID
        cudaStream_t stream
    );

Profiling Prefetches

The following shows profile statistics using nsys of two kernel which do exactly the same operation (squaring a series of float values on GPU).

With prefetching


CUDA Memory Operation Statistics (by time):

 Time (%)  Total Time (ns)  Count  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)              Operation
 --------  ---------------  -----  ---------  ---------  --------  --------  -----------  ---------------------------------
    100.0       10,592,835     64  165,513.0  165,313.0   165,249   178,465      1,645.6  [CUDA Unified Memory memcpy DtoH]


CUDA Memory Operation Statistics (by size):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)              Operation
 ----------  -----  --------  --------  --------  --------  -----------  ---------------------------------
    134.218     64     2.097     2.097     2.097     2.097        0.000  [CUDA Unified Memory memcpy DtoH]

Without prefetching

CUDA Memory Operation Statistics (by time):

 Time (%)  Total Time (ns)  Count  Avg (ns)  Med (ns)  Min (ns)  Max (ns)  StdDev (ns)              Operation
 --------  ---------------  -----  --------  --------  --------  --------  -----------  ---------------------------------
     67.0       22,186,252  1,536  14,444.2   3,935.0     1,278   120,226     23,567.2  [CUDA Unified Memory memcpy HtoD]
     33.0       10,937,857    768  14,242.0   3,359.0       895   102,529     23,680.5  [CUDA Unified Memory memcpy DtoH]

[9/9] Executing 'gpumemsizesum' stats report

CUDA Memory Operation Statistics (by size):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)              Operation
 ----------  -----  --------  --------  --------  --------  -----------  ---------------------------------
    268.435  1,536     0.175     0.033     0.004     1.044        0.301  [CUDA Unified Memory memcpy HtoD]
    134.218    768     0.175     0.033     0.004     1.044        0.301  [CUDA Unified Memory memcpy DtoH]

Of note are two things: the time taken for memory operations without prefetching is nearly triple that of the prefetch, and the size of memory operations is much smaller. The only difference between both kernels is a call to cudaMemPrefetchAsync for any data structure that was to be copied to device (GPU).

int deviceId;

const int N = 2<<24;
size_t size = N * sizeof(float);
// Declare a float pointer
float *a;
// Set up unified memory
cudaMallocManaged(&a, size);

// Up till this point is usual, the only difference is this call to the prefetch
cudaMemPrefetchAsync(a, size, deviceId);

// Go on to specify the number of threads, blocks etc.