当前位置:网站首页>2022cuda summer training camp day3 practice
2022cuda summer training camp day3 practice
2022-07-29 10:27:00 【Hua Weiyun】

2022CUDA Summer training camp Day1 practice https://bbs.huaweicloud.com/blogs/364478
2022CUDA Summer training camp Day2 practice https://bbs.huaweicloud.com/blogs/364479
Today is the third day , The theme is Event,Memory and Shared Memory.
( One )Event
Let's start with the first one ——Event.
Event yes CUDA In the event , Used to analyze 、 testing CUDA Errors in the program .
Generally, we will define a macro :
#pragma once#include <stdio.h>#define CHECK(call) \do \{ \ const cudaError_t error_code = call; \ if (error_code != cudaSuccess) \ { \ printf("CUDA Error:\n"); \ printf(" File: %s\n", __FILE__); \ printf(" Line: %d\n", __LINE__); \ printf(" Error code: %d\n", error_code); \ printf(" Error text: %s\n", \ cudaGetErrorString(error_code)); \ exit(1); \ } \} while (0)And use this macro in the appropriate place to print CUDA Error log for .
notes :#pragma once, Don't put it in the source code file , This is usually only in the header file .( Prevent header files from being introduced multiple times )
The specific call process is as follows :

The specific order is as follows :
(1) Statement Event( Here we calculate the time before and after the kernel function runs start Event and stop Event For example )
cudaEvent_t start, stop;(2) establish Event
CHECK(cudaEventCreate(&start));CHECK(cudaEventCreate(&stop));(3) add to Event( In the right place )
cudaEventRecord(start);cudaEventRecord(stop);(4) wait for Event complete
(a) Non blocking mode —— It can be used for some processing without waiting
cudaEventQuery(start);(b) Blocking mode —— It can be used to wait for the processing after the kernel function is executed
cudaEventSynchronize(stop);(5) Calculate two Event Time interval between
CHECK(cudaEventElapsedTime(&elapsed_time, start, stop));(6) The destruction Event
CHECK(cudaEventDestroy(start));CHECK(cudaEventDestroy(stop));The complete code is as follows :
#pragma once#include <stdio.h>#define CHECK(call) \do \{ \ const cudaError_t error_code = call; \ if (error_code != cudaSuccess) \ { \ printf("CUDA Error:\n"); \ printf(" File: %s\n", __FILE__); \ printf(" Line: %d\n", __LINE__); \ printf(" Error code: %d\n", error_code); \ printf(" Error text: %s\n", \ cudaGetErrorString(error_code)); \ exit(1); \ } \} while (0)#include <stdio.h>#include <math.h>#include "error.cuh"#define BLOCK_SIZE 32__global__ void gpu_matrix_mult(int *a,int *b, int *c, int m, int n, int k){ int row = blockIdx.y * blockDim.y + threadIdx.y; int col = blockIdx.x * blockDim.x + threadIdx.x; int sum = 0; if( col < k && row < m) { for(int i = 0; i < n; i++) { sum += a[row * n + i] * b[i * k + col]; } c[row * k + col] = sum; }} void cpu_matrix_mult(int *h_a, int *h_b, int *h_result, int m, int n, int k) { for (int i = 0; i < m; ++i) { for (int j = 0; j < k; ++j) { int tmp = 0.0; for (int h = 0; h < n; ++h) { tmp += h_a[i * n + h] * h_b[h * k + j]; } h_result[i * k + j] = tmp; } }}int main(int argc, char const *argv[]){ int m=100; int n=100; int k=100; // Statement Event cudaEvent_t start, stop, stop2, stop3 , stop4 ; // establish Event CHECK(cudaEventCreate(&start)); CHECK(cudaEventCreate(&stop)); CHECK(cudaEventCreate(&stop2)); int *h_a, *h_b, *h_c, *h_cc; CHECK(cudaMallocHost((void **) &h_a, sizeof(int)*m*n)); CHECK(cudaMallocHost((void **) &h_b, sizeof(int)*n*k)); CHECK(cudaMallocHost((void **) &h_c, sizeof(int)*m*k)); CHECK(cudaMallocHost((void **) &h_cc, sizeof(int)*m*k)); for (int i = 0; i < m; ++i) { for (int j = 0; j < n; ++j) { h_a[i * n + j] = rand() % 1024; } } for (int i = 0; i < n; ++i) { for (int j = 0; j < k; ++j) { h_b[i * k + j] = rand() % 1024; } } int *d_a, *d_b, *d_c; CHECK(cudaMalloc((void **) &d_a, sizeof(int)*m*n)); CHECK(cudaMalloc((void **) &d_b, sizeof(int)*n*k)); CHECK(cudaMalloc((void **) &d_c, sizeof(int)*m*k)); // copy matrix A and B from host to device memory CHECK(cudaMemcpy(d_a, h_a, sizeof(int)*m*n, cudaMemcpyHostToDevice)); CHECK(cudaMemcpy(d_b, h_b, sizeof(int)*n*k, cudaMemcpyHostToDevice)); unsigned int grid_rows = (m + BLOCK_SIZE - 1) / BLOCK_SIZE; unsigned int grid_cols = (k + BLOCK_SIZE - 1) / BLOCK_SIZE; dim3 dimGrid(grid_cols, grid_rows); dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE); // Start start Event cudaEventRecord(start); // Non-blocking mode cudaEventQuery(start); //gpu_matrix_mult<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, m, n, k); gpu_matrix_mult_shared<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, m, n, k); // Start stop Event cudaEventRecord(stop); // Because we have to wait for the kernel function to finish executing , So choose blocking mode cudaEventSynchronize(stop); // computing time stop-start float elapsed_time; CHECK(cudaEventElapsedTime(&elapsed_time, start, stop)); printf("start-》stop:Time = %g ms.\n", elapsed_time); cudaMemcpy(h_c, d_c, (sizeof(int)*m*k), cudaMemcpyDeviceToHost); //cudaThreadSynchronize(); // Start stop2 Event CHECK(cudaEventRecord(stop2)); // Non-blocking mode //CHECK(cudaEventSynchronize(stop2)); cudaEventQuery(stop2); // computing time stop-stop2 float elapsed_time2; cudaEventElapsedTime(&elapsed_time2, stop, stop2); printf("stop-》stop2:Time = %g ms.\n", elapsed_time2); // The destruction Event CHECK(cudaEventDestroy(start)); CHECK(cudaEventDestroy(stop)); CHECK(cudaEventDestroy(stop2)); //CPU Function calculation cpu_matrix_mult(h_a, h_b, h_cc, m, n, k); int ok = 1; for (int i = 0; i < m; ++i) { for (int j = 0; j < k; ++j) { if(fabs(h_cc[i*k + j] - h_c[i*k + j])>(1.0e-10)) { ok = 0; } } } if(ok) { printf("Pass!!!\n"); } else { printf("Error!!!\n"); } // free memory cudaFree(d_a); cudaFree(d_b); cudaFree(d_c); cudaFreeHost(h_a); cudaFreeHost(h_b); cudaFreeHost(h_c); return 0;}There will be some warnings when compiling , Don't worry about it. :

stay Quardo P1000 Of GPU On the implementation :

stay Jetson Nano B01 On the implementation :

Here take matrix multiplication as an example , Print the time to call the matrix multiplication kernel function , And the back cudaMemcpy Time for .
We forced
CHECK(cudaMemcpy(d_b, h_b, sizeof(int)*n*k, cudaMemcpyHostToDevice));Change it to
CHECK(cudaMemcpy(d_b, h_b, sizeof(int)*n*k*2, cudaMemcpyHostToDevice));Deliberately let it out of bounds .
recompile , function , Look at the effect :

The system will tell you There is something wrong with this line :

In this way, we can track CUDA Error in call .
Here we need to summarize that Zhang Xiaobai is debugging CHECK Several problems found in the process :
(1) without CHECK(cudaEventCreate()) Just call directly cudaEventRecord() Or execute the following Event function , It will result in failure to print information . Zhang Xiaobai was concerned about stop2 This event Just made this mistake , Lead to stop->stop2 I can't find the time .
(2) about cudaEventQuery() You can't add CHECK Of , If it is added, it will report an error :
In the above environment , If you write like this :
CHECK(cudaEventQuery(stop2));The following errors will appear when compiling and executing :

notes :cudaEventQuery Of cudaErrorNotReady It represents that the event has not happened ( Not yet recorded ), It doesn't mean error .
Use Nano Check the performance :
!echo nano | sudo -S /usr/local/cuda/bin/nvprof ./matrix_mul
( Two )Memory
This is about CUDA Storage unit , Actually Day2 Also mentioned :

Let's go further today :

Zhang Xiaobai sorted out the following table :( Not all , It needs to be perfected )
| name | Location | purpose | Usage method | Limit | remarks |
|---|---|---|---|---|---|
| Register register | GPU Of SM On | Store local variables | Every SM There are thousands of The maximum number of a thread is 256 individual Need to save | Thread private , The fastest Thread exit will be invalid | |
| Shared memory | GPU On chip | Realization Block Thread communication within , At present, it is much faster Thread Where to communicate | __shared__ Modifier need __syncThreads() Sync | It is divided into 32 individual banks Need to save , Will affect activities warp Number | Can be 1 individual block all thread visit , Sub fast High bandwidth , Low latency |
| Local memory | Store large arrays and variables of a single thread (Register Use it when not enough ) | There is no specific storage unit | Thread private , Slower , Speed vs Global memory near | ||
| Constant memory Constant memory | Resident in device memory in | For the same warp All of the thread Access the same constant data at the same time , Like ray tracing | __constant__ Modifier Must be in host End use cudaMemcpyToSymbol initialization | There is no specific storage unit , But there is a separate cache | read-only , overall situation |
| Global memory | Equate to GPU memory Resident in device memory in | input data , Write results | overall situation , Slower | ||
| Texture memory Texture memory | For accelerating local access , Such as heat conduction model | read-only , overall situation , The speed is inferior to Shared Memory( Delay ratio Shared Memory high , Bandwidth ratio hared Memory Small ) | |||
| Host memory: Pageable memory | Host side memory | Use malloc Access to use free Release | Not available DMA visit | Memory pages can be replaced to disk | |
| Another kind Host memory: also called : Page-locked Memory,Zero-Copy Memory | Host side memory | Use cudaMallocHost visit Use cudaFreeHost Release | Belong to another kind Global memory | ||
( 3、 ... and )Shared Memory
Here is how to use Shared Memory Optimize CUDA application

Shared Memory It is characterized by being especially fast when it is fast , It's especially slow when it's slow .
When is it fast ?
same warp All threads in access different banks
perhaps same warp All threads in read the same address ( By radio )
When is it slow ?
same warp Multiple threads access the same bank Different addresses for ( There will be bank conflict)
Serial access
Please note that :bank conflict The reason for this is warp Distribution and bank The distribution of overlapped :

How to avoid bank conflict, The simple way is Padding Law ( It seems to be called patching ):

By adding an empty column , Give Way bank Forced dislocation , So that each piece of continuous data is allocated to different bank in .
It's very simple :

It's setting up Shared Memory When , Not set to Matrix BLOCK_SIZE X BLOCK_SIZE, And set to BLOCK_SIZE X (BLOCK_SIZE+1).
Last , We can use Shared Memory Optimize mXn, nXk Matrix multiplication of Code for , Improve the efficiency of storage access .
The specific method is as follows :
Apply for two Shared Memory, All are BLOCK_SIZE X BLOCK_SIZE size . One along the matrix mXn slide , One along the matrix nXk slide . take The result of the subset is added to In the purpose matrix :

The specific code is as follows :
__global__ void gpu_matrix_mult_shared(int *d_a, int *d_b, int *d_result, int m, int n, int k) { __shared__ int tile_a[BLOCK_SIZE][BLOCK_SIZE]; __shared__ int tile_b[BLOCK_SIZE][BLOCK_SIZE]; int row = blockIdx.y * BLOCK_SIZE + threadIdx.y; int col = blockIdx.x * BLOCK_SIZE + threadIdx.x; int tmp = 0; int idx; for (int sub = 0; sub < gridDim.x; ++sub) { idx = row * n + sub * BLOCK_SIZE + threadIdx.x; tile_a[threadIdx.y][threadIdx.x] = row<n && (sub * BLOCK_SIZE + threadIdx.x)<n? d_a[idx]:0; idx = (sub * BLOCK_SIZE + threadIdx.y) * n + col; tile_b[threadIdx.y][threadIdx.x] = col<n && (sub * BLOCK_SIZE + threadIdx.y)<n? d_b[idx]:0; __syncthreads(); for (int k = 0; k < BLOCK_SIZE; ++k) { tmp += tile_a[threadIdx.y][k] * tile_b[k][threadIdx.x]; } __syncthreads(); } if(row < n && col < n) { d_result[row * n + col] = tmp; }}And put the front Where matrix multiplication is called in the code :
gpu_matrix_mult<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, m, n, k); Change it to
gpu_matrix_mult_shared<<<dimGrid, dimBlock>>>(d_a, d_b, d_c, m, n, k); The rest remains the same .
Start compilation , stay Jetson Nano B01 On the implementation :

Compare the previous matrix multiplication code ,start-》stop:Time = 2.25109 ms Time drops slightly .
Zhang Xiaobai then revised blocksize, Change them into 16,8,4, Then carry out statistical summary :
| matrix MXN(m) | matrix NXK(n) | matrix NXK(k) | blocksize | stop-start(ms) |
|---|---|---|---|---|
| 100 | 100 | 100 | 32 | 1.83286 |
| 100 | 100 | 100 | 16 | 1.27365 |
| 100 | 100 | 100 | 8 | 1.23292 |
| 100 | 100 | 100 | 4 | 3.52865 |
| 100 | 100 | 100 | 6( Supplementary measurement ) | 2.1999 |
| 100 | 100 | 100 | 12( Supplementary measurement ) | 1.34755 |
From the above results ,blocksize by 8,16,32 There seems to be little difference , however blocksize by 4 The speed drops sharply when . from 100 by 4 In terms of multiples of . It seems to be this time wrap and bank Overlapped .
Then we use Padding Dafa, look :

take tile_a and tile_b The square matrix of is changed to the square matrix of mending sides :

It seems that the effect is not very good .
notes : stay blocksize by 4 when , Actually, it didn't happen bank conflict! And just because 4X4, Only 16 Threads , And one warp need 32 Threads , So it's equivalent to calculating , Half of the computing power is wasted , Then the speed is twice as slow . Teacher Huan suggested , At least NXN>32 better .
So Zhang Xiaobai will blocksize set 6, Try again , As a result, the above table is inserted . Of course , The speed still drops slightly ( It's the same below ). My personal guess is , If it is 6636 , Actually 32 One thread warp, Instead, we need 2 individual warp To get the job done , So the speed is still not good . Zhang Xiaobai guessed that blocksize The square of is set to 32 The multiple of is the most appropriate . For example, 8864 ...12X12=32X4.5, It seems not suitable .. However, the speed may be slightly improved due to more use ( It turns out to be the same )
Zhang Xiaobai worries that the matrix is too small , take Matrix from 100 Change it to 1000 try .
But once the discovery is changed to 1000 after ,CPU The calculation may not work :

I have to put CPU That part of the code is shielded from the code compared later .

recount :
| matrix MXN(m) | matrix NXK(n) | matrix NXK(k) | blocksize | stop-start(ms) |
|---|---|---|---|---|
| 1000 | 1000 | 1000 | 32 | 265.106 |
| 1000 | 1000 | 1000 | 16 | 228.09 |
| 1000 | 1000 | 1000 | 8 | 202.382 |
| 1000 | 1000 | 1000 | 4 | 518.315 |
| 1000 | 1000 | 1000 | 6( Supplementary measurement ) | 386.171 |
| 1000 | 1000 | 1000 | 12( Supplementary measurement ) | 246.29 |
Zhang Xiaobai used Padding I tried :

It seems that there is no acceleration effect :( It's slower )

notes : Empathy , In fact, nothing happened bank conflict, It's all in Zhang Xiaobai's heart YY..
( To be continued )
边栏推荐
- Intel joins hands with datawhale to launch learning projects!
- Static resource mapping
- 这才是开发者神器正确的打开方式
- Big cloud service company executives changed: technology gives way to sales
- Create PHP message board system with kubernetes
- [semantic segmentation] 2021-pvt2 cvmj
- Be tolerant and generous
- Attachment of text of chenjie Report
- 敏捷开发如何消减协作中的认知偏差?| 敏捷之道
- Tips of Day1 practice in 2022cuda summer training camp
猜你喜欢

全面、详细的SQL学习指南(MySQL方向)

静态资源映射

Implementation of college logistics repair application system based on SSM

【论文阅读】I-BERT: Integer-only BERT Quantization

“为机器立心”:朱松纯团队搭建人与机器人的价值双向对齐系统,解决人机协作领域的重大挑战

这才是开发者神器正确的打开方式

Does neural network sound tall? Take you to train a network from scratch (based on MNIST)

Function - (C travel notes)

Scrape crawler framework

Is there any charge for PDF processing? impossible
随机推荐
Follow teacher Li to learn line generation determinant (continuous update)
ECCV 2022 | CMU proposes to recurse on the visual transformer without adding parameters, and the amount of calculation is still small
【论文阅读】I-BERT: Integer-only BERT Quantization
[Yugong series] go teaching course 009 in July 2022 - floating point type of data type
Consumer electronics, frozen to death in summer
Talk about multithreaded concurrent programming from a different perspective without heap concept
电竞入亚后,腾讯要做下一个“NBA赛事捕手”?
Achieve the effect of a menu tab
“为机器立心”:朱松纯团队搭建人与机器人的价值双向对齐系统,解决人机协作领域的重大挑战
Easy to understand and explain the gradient descent method!
Solve problems intelligently
Implementation of college logistics repair application system based on SSM
TMS320C6000_ Tms320f28035 Chinese data manual
Oracle advanced (XIV) explanation of escape characters
Docker安装Redis、配置及远程连接
数据可视化设计指南(信息图表篇)
Method of cocos2d-x sprite moving
二次握手??三次挥手??
[wechat applet] interface generates customized homepage QR code
Correct posture and landing practice of R & D efficiency measurement (speech ppt sharing version)