当前位置:网站首页>Dynamic global memory allocation and operation in CUDA
Dynamic global memory allocation and operation in CUDA
2022-07-02 06:28:00 【Little Heshang sweeping the floor】
Dynamic Global Memory Allocation and Operations
Dynamic global memory allocation and operation are only affected by computing power 2.x And later device support .
__host__ __device__ void* malloc(size_t size);
__device__ void *__nv_aligned_device_malloc(size_t size, size_t align);
__host__ __device__ void free(void* ptr);
Dynamically allocate and free memory from a fixed size heap in global memory .
__host__ __device__ void* memcpy(void* dest, const void* src, size_t size);
from src
Copy the memory location pointed to size
Byte to dest
Point to memory location .
__host__ __device__ void* memset(void* ptr, int value, size_t size);
take ptr
Of the memory block pointed to size
The byte is set to value
( Interpreted as an unsigned character ).
CUDA Kernel malloc()
Function allocates at least size
Bytes , And returns a pointer to the allocated memory , If there is not enough memory to satisfy the request , Then return to NULL. The returned pointer is guaranteed to be the same as 16 Byte boundary alignment .
Kernel CUDA __nv_aligned_device_malloc()
Function allocates at least size
Bytes , And returns a pointer to the allocated memory , If memory is insufficient to meet the requested size or alignment , Then return to NULL. The address to allocate memory will be align
Multiple . align
Must be 2 The nonzero power of .
CUDA Kernel free()
Function to release ptr
Memory pointed to , This memory must be previously paired malloc()
or __nv_aligned_device_malloc()
The call to return . If ptr
by NULL, Ignore the right free()
Call to . Use the same ptr
Repeated calls to free()
Have undefined behavior .
Given CUDA Threads pass through malloc()
or __nv_aligned_device_malloc()
The allocated memory is CUDA The context remains allocated throughout its lifecycle , Or until you call free()
Explicit release . It can be used by any other CUDA Thread usage , This is true even when the subsequent kernel starts . whatever CUDA A thread can release the memory allocated by another thread , However, care should be taken to ensure that the same pointer is not released multiple times .
1. Heap Memory Allocation
The device memory heap has a fixed size , Must be used in any malloc()、__nv_aligned_device_malloc() or free()
Specify the size before loading the program into the context . If any program uses without specifying the heap size malloc() or __nv_aligned_device_malloc()
, Will be assigned 8 MB Default heap for .
following API Function to get and set the heap size :
cudaDeviceGetLimit(size_t* size, cudaLimitMallocHeapSize)
cudaDeviceSetLimit(cudaLimitMallocHeapSize, size_t size)
The granted heap size is at least size
Bytes . cuCtxGetLimit() and cudaDeviceGetLimit()
Returns the current requested heap size .
When the module is loaded into the context , The actual memory allocation of the heap occurs , Or explicitly through CUDA The driver API( See modular ), Or implicitly through CUDA Runtime API( See CUDA Runtime ). If memory allocation fails , Module loading will produce CUDA_ERROR_SHARED_OBJECT_INIT_FAILED
error .
Once module loading occurs , The heap size cannot be changed , And it will not dynamically resize as needed .
Except through the host side CUDA API call ( for example cudaMalloc()
) Allocated outside the memory reserved for the device heap .
2. Interoperability with Host Memory API
Through the equipment malloc()
or __nv_aligned_device_malloc()
The allocated memory cannot be released by the runtime ( namely , By calling any free memory function from device memory ).
Again , Memory allocated by runtime ( namely , By calling any memory allocation function from device memory ) Cannot pass free()
Release .
Besides , Call in device code malloc()
or __nv_aligned_device_malloc()
The allocated memory cannot be used for any runtime or driver API call ( namely cudaMemcpy
、cudaMemset
etc. ).
3. Examples
3.1. Per Thread Allocation
#include <stdlib.h>
#include <stdio.h>
__global__ void mallocTest()
{
size_t size = 123;
char* ptr = (char*)malloc(size);
memset(ptr, 0, size);
printf("Thread %d got pointer: %p\n", threadIdx.x, ptr);
free(ptr);
}
int main()
{
// Set a heap size of 128 megabytes. Note that this must
// be done before any kernel is launched.
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<1, 5>>>();
cudaDeviceSynchronize();
return 0;
}
The above code will output :
Thread 0 got pointer: 00057020
Thread 1 got pointer: 0005708c
Thread 2 got pointer: 000570f8
Thread 3 got pointer: 00057164
Thread 4 got pointer: 000571d0
Notice how each thread encounters malloc()
and memset()
command , So as to receive and initialize their own assignments . ( The exact pointer value will vary : These are illustrative .)
3.2. Per Thread Block Allocation
#include <stdlib.h>
__global__ void mallocTest()
{
__shared__ int* data;
// The first thread in the block does the allocation and then
// shares the pointer with all other threads through shared memory,
// so that access can easily be coalesced.
// 64 bytes per thread are allocated.
if (threadIdx.x == 0) {
size_t size = blockDim.x * 64;
data = (int*)malloc(size);
}
__syncthreads();
// Check for failure
if (data == NULL)
return;
// Threads index into the memory, ensuring coalescence
int* ptr = data;
for (int i = 0; i < 64; ++i)
ptr[i * blockDim.x + threadIdx.x] = threadIdx.x;
// Ensure all threads complete before freeing
__syncthreads();
// Only one thread may free the memory!
if (threadIdx.x == 0)
free(data);
}
int main()
{
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
mallocTest<<<10, 128>>>();
cudaDeviceSynchronize();
return 0;
}
3.3. Allocation Persisting Between Kernel Launches
#include <stdlib.h>
#include <stdio.h>
#define NUM_BLOCKS 20
__device__ int* dataptr[NUM_BLOCKS]; // Per-block pointer
__global__ void allocmem()
{
// Only the first thread in the block does the allocation
// since we want only one allocation per block.
if (threadIdx.x == 0)
dataptr[blockIdx.x] = (int*)malloc(blockDim.x * 4);
__syncthreads();
// Check for failure
if (dataptr[blockIdx.x] == NULL)
return;
// Zero the data with all threads in parallel
dataptr[blockIdx.x][threadIdx.x] = 0;
}
// Simple example: store thread ID into each element
__global__ void usemem()
{
int* ptr = dataptr[blockIdx.x];
if (ptr != NULL)
ptr[threadIdx.x] += threadIdx.x;
}
// Print the content of the buffer before freeing it
__global__ void freemem()
{
int* ptr = dataptr[blockIdx.x];
if (ptr != NULL)
printf("Block %d, Thread %d: final value = %d\n",
blockIdx.x, threadIdx.x, ptr[threadIdx.x]);
// Only free from one thread!
if (threadIdx.x == 0)
free(ptr);
}
int main()
{
cudaDeviceSetLimit(cudaLimitMallocHeapSize, 128*1024*1024);
// Allocate memory
allocmem<<< NUM_BLOCKS, 10 >>>();
// Use memory
usemem<<< NUM_BLOCKS, 10 >>>();
usemem<<< NUM_BLOCKS, 10 >>>();
usemem<<< NUM_BLOCKS, 10 >>>();
// Free memory
freemem<<< NUM_BLOCKS, 10 >>>();
cudaDeviceSynchronize();
return 0;
}
边栏推荐
- The intern left a big hole when he ran away and made two online problems, which made me miserable
- Redis - big key problem
- Data science [9]: SVD (2)
- 深入学习JVM底层(四):类文件结构
- Android - Kotlin 下使用 Room 遇到 There are multiple good constructors and Room will ... 问题
- Use of Arduino wire Library
- 注解和反射详解以及运用
- VLAN experiment of switching technology
- 利用传统方法(N-gram,HMM等)、神经网络方法(CNN,LSTM等)和预训练方法(Bert等)的中文分词任务实现
- 利用NVIDIA GPU将Minecraft场景渲染成真实场景
猜你喜欢
Alibaba cloud MFA binding Chrome browser
数据科学【八】:SVD(一)
最新CUDA环境配置(Win10 + CUDA 11.6 + VS2019)
Shardingsphere JDBC
Support new and old imperial CMS collection and warehousing tutorials
WLAN相关知识点总结
稀疏数组(非线性结构)
加密压缩文件解密技巧
It is said that Kwai will pay for the Tiktok super fast version of the video? How can you miss this opportunity to collect wool?
CUDA中的线程层次
随机推荐
Arduino Wire 库使用
日志 - 7 - 记录一次丢失文件(A4纸)的重大失误
In depth understanding of JUC concurrency (I) what is JUC
TensorRT的数据格式定义详解
Idea announced a new default UI, which is too refreshing (including the application link)
Hydration failed because the initial UI does not match what was rendered on the server. One of the reasons for the problem
最新CUDA环境配置(Win10 + CUDA 11.6 + VS2019)
Android - Kotlin 下使用 Room 遇到 There are multiple good constructors and Room will ... 问题
一起学习SQL中各种join以及它们的区别
IPv6 experiment and summary
ShardingSphere-JDBC篇
分布式事务 :可靠消息最终一致性方案
LeetCode 40. Combined sum II
Data science [viii]: SVD (I)
Cglib agent - Code enhancement test
CUDA用户对象
构建学习tensorflow
pytest(3)parametrize参数化
Eggjs -typeorm 之 TreeEntity 实战
Redis - grande question clé