当前位置:网站首页>CUDA implements matrix replication
CUDA implements matrix replication
2022-06-30 08:31:00 【Wu lele~】
The key is thread allocation , A matrix that divides a large matrix into pieces , Threaded ny,nx To represent the rows and columns of the matrix respectively , But because the general matrix is stored in a one-dimensional array , So finally, it corresponds to the global memory index You can also convert rows and columns into one-dimensional index.
#include <stdio.h>
const int N = 128; // matrix' dim
const int TILE_DIM = 32; // block size
const int SIZE = sizeof(int) * N * N; // memory
void __global__ copy(int *da, int *db, const int N);
void __global__ copy(int *da, int *db, const int N)
{
int ny = blockIdx.y * TILE_DIM + threadIdx.y;
int nx = blockIdx.x * TILE_DIM + threadIdx.x;
int index = ny*N + nx; // get each index
if(ny<N && nx<N)
{
db[index] = da[index];
}
}
int main(int argc, char *argv[])
{
// host memory and assignment
int *ha, *hb;
ha = (int *)malloc(SIZE);
hb = (int *)malloc(SIZE);
for(int i=0; i< N*N; ++i)
{
ha[i] = 100;
}
// device memry
int *da, *db;
cudaMalloc((void **)&da, SIZE);
cudaMalloc((void **)&db, SIZE);
cudaMemcpy(da, ha, SIZE, cudaMemcpyHostToDevice);
// kernel function
const dim3 block_size(TILE_DIM, TILE_DIM);
const int grid_size_x = (N +TILE_DIM -1) / TILE_DIM;
const int grid_size_y = grid_size_x;
const dim3 grid_size(grid_size_x, grid_size_y);
copy<<<grid_size,block_size>>>(da,db,N);
// device to host
cudaMemcpy(hb,db,SIZE,cudaMemcpyDeviceToHost);
printf("%d\n",hb[100]);
// free
free(ha);
free(hb);
cudaFree(da);
cudaFree(db);
return 0;
}
边栏推荐
- Flink Exception -- No ExecutorFactory found to execute the application
- A troubleshooting of CPU bottom falling
- PHP API to obtain QR code and combine to generate pictures
- Deploy the cow like customer network project on the ECS
- Introduction to opencv (II): image color space conversion and image saving
- Unity basic lighting model
- 2021-02-27
- 牛客小白月赛52
- [untitled]
- Tidb 6.0: making Tso more efficient tidb Book rush
猜你喜欢

Wechat official account third-party platform development, zero foundation entry. I want to teach you
![[kotlin collaboration process] complete the advanced kotlin collaboration process](/img/43/9c4b337caf406537e317dea2ed5f17.png)
[kotlin collaboration process] complete the advanced kotlin collaboration process

Transformer architecture understanding

Wsl2 using GPU for deep learning

Redis设计与实现(五)| Sentinel哨兵

Axure制作菜单栏效果
![[flower carving experience] 12 build the Arduino development environment of esp32c3](/img/76/a66e6d5c62d25067841b47eb01b718.jpg)
[flower carving experience] 12 build the Arduino development environment of esp32c3

Tidb 6.0: making Tso more efficient tidb Book rush

TiDB v6.0.0 (DMR) :缓存表初试丨TiDB Book Rush

Gilbert Strang's course notes on linear algebra - Lesson 4
随机推荐
Vite project require syntax compatibility problem solving require is not defined
Unity simple shader
[untitled]
[JUC series] overview of fork/join framework
Dart tips
Oracle expansion table space installed in docker
Experiment 6 examination
文件上传 upload 组件 on-success 事件,添加自定义参数
El input limit can only input numbers
Experiment 3 remote control
Dlib library blink
Gilbert Strang's course notes on linear algebra - Lesson 4
Unit Test
[nvme2.0b 14-7] set features (Part 1)
【NVMe2.0b 14-5】Firmware Download/Commit command
Map,String,Json之間轉換
【NVMe2.0b 14-1】Abort、Asynchronous Event Request、Capacity Management command
2021-05-06
电流探头的干扰源电流谱测试
【NVMe2.0b 14-4】Directive Send/Receive command