当前位置:网站首页>Introduction to machine learning compilation course learning notes lesson 2 tensor program abstraction
Introduction to machine learning compilation course learning notes lesson 2 tensor program abstraction
2022-06-30 22:34:00 【herosunly】
In this lesson slides Links are as follows :https://mlc.ai/summer22-zh/slides/2-TensorProgram.pdf;notes Links are as follows :https://mlc.ai/zh/chapter_tensor_program/.
List of articles
1. The outline of this lesson
- Metatensor function ( Tensor operator )
- Tensor program abstraction
Through this lesson, I have a deeper understanding of batch processing . Batch processing and small batch gradient descent are similar ( The premise is that the calculation is independent ). This will be described in more detail later .
2. Metatensor function
A meta tensor function is a single element ( Minimum particle size ) The tensor function of . With PyTorch Take addition in as an example :
import torch
C = torch.empty((128,), dtype=torch.float32)
a = torch.tensor(np.arange(128, dtype="float32"))
b = torch.tensor(np.ones(128, dtype="float32"))
torch.add(a,b,out=c) # torch.add can be viewed as a primitive tensor function
that torch.add How to achieve it ? Or in actual deployment , How to put different steps ( Multiple tensor functions ) Merge together ?
abstract And Realization Is one of the core concepts of this course . For the same tensor operator , There are different ways to achieve :
The second representation is a refinement of the first representation , The third representation is the use of low-level languages (C Language , Even in assembly language ) For more efficient implementation .
One most common MLC process that many frameworks offer is to transform the implementations of primitive functions(or dispatch them in runtime) to more optimized ones based on the environment.
A common method in tensor operator transformation is to map it directly to the corresponding operator library . For example CUDA Environment , You can use cudaAdd. Another common one is Finer grained program transformation . The former can be directly represented by symbols , The latter needs to be expressed in a circular form .

3. Tensor program abstraction
The tensor operator transformation mentioned above includes two common forms , One is to map directly to the corresponding operator library , The other is to carry out finer grained program transformation . The latter is called Tensor function abstraction .
The picture below is TVM An example in :
Tensor function abstraction includes three key elements :
- Multidimensional input 、 Output data .
- loop .
- In each cycle , Make the corresponding calculation .

It involves a problem : That is why we don't use C Language and other low-level languages . It mainly lies in that the core of machine learning compilation is not only that the program ape can complete the coding , It is more important to complete the tensor operator transformation , That is, given a tensor, it will be transformed from the original form to a more optimized form ( Try different transformations , Finally find the most optimized form ). If we can explore the equivalent space corresponding to the possible tensor function through the program , Can greatly improve efficiency .
3.1 Why do we need tensor program abstraction

stay Program-based transformnations Loop splitting in , In essence, it is to transform a single-layer cycle into a double-layer cycle ( Loop splitting is a common transformation ), and Transformed program It involves parallelization and vectorization , Vectorization here means that each step is based on the length 4 Unit of ( The batch ) To do an operation .
3.2 Common tensor transformations
3.2.1 Loop split

The equivalent substitution includes : x = x o ∗ 4 + x i x=x_o*4+x_i x=xo∗4+xi
3.2.2 Cyclic rearrangement
The essence of cycle rearrangement is to replace the sequence of two cycles .
3.2.3 Thread binding

If in GPU Parallelize on , You can put xi and xo Corresponding to GPU On two threads of . By binding GPU The thread of gets the final result ( More in-depth explanations will be given in subsequent chapters ).
Try different transformation combinations in different hardware environments , Then, according to the test, the optimal transformation is obtained in the equivalent space corresponding to the tensor function . One thing to note is that , The transformation here must be equivalent . Here's a counterexample , In essence, it can be expressed as C = A + C C=A+C C=A+C,C For others index Under the C The results are dependent ( stay xo = 4, xi = 0 The situation will be right xo = 0, xi = 1 Results in dependency , And if we change the order of the loop xo = 0, xi = 1 It will be circulated to ), If it's just simple reorder, The order of the loop will change, causing the dependency to be broken , The left side of the equation will appear C To the right of the equation C There is no result yet . So there is no loop rearrangement here .
for xo in range(32):
for xi in range(4):
C[xo * 4 + xi] = A[xo * 4 + xi] + C[max(xo * 4 + xi – 1, 0)]
3.3 Other structures in tensor program abstraction
According to the above , We cannot change the program arbitrarily ( For example, some calculations will depend on the order between loops ). But fortunately , Most of the metatensor functions we are interested in have good properties ( For example, between loop iterations independence ).
The tensor program can combine this additional information into a part of the program , To make program transformation more convenient .
among vi = T.axis.spatial(128, i) Not only is it equivalent to vi=i, And show that vi Represents an iterator , And all iterations in the iterator are independent . This information is not necessary to execute the program , But it will make it more convenient for us to change this program . In this case , We know we can Safely parallel Or reorder all and vi About the cycle , As long as the actual implementation vi Value of from 0 To 128 The order of change .
23:20
4. Tensor program transformation practice
The first thing to say is : If there is already Colab account number , It is more recommended to directly use Colab Environmental Science , Then you can directly use the following pip Installation package .
4.1 Installation package
The official installation order is as follows , But because of https://mlc.ai/wheels You need a proxy to connect , If not Colab The following commands are not recommended for environment installation :
python3 -m pip install mlc-ai-nightly -f https://mlc.ai/wheels
The recommended method is to manually download a file that is suitable for the environment wheel file , Then install :
Personal environment is Linux+Python3.8 Environmental Science , The download link of the corresponding file is :https://github.com/mlc-ai/utils/releases/download/v0.9.dev0/mlc_ai_nightly-0.9.dev1661%2Bg2b34ced6c-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
4.2 Construct tensor program
import numpy as np
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
@tvm.script.ir_module
class MyModule:
@T.prim_func
def main(A: T.Buffer[128, "float32"],
B: T.Buffer[128, "float32"],
C: T.Buffer[128, "float32"]):
# extra annotations for the function
T.func_attr({
"global_symbol": "main", "tir.noalias": True})
for i in range(128):
with T.block("C"):
# declare a data parallel iterator on spatial domain
vi = T.axis.spatial(128, i)
C[vi] = A[vi] + B[vi]
TVMScript It's a way for us to Python The way to represent tensor programs in the form of abstract syntax trees . Notice that this code does not actually correspond to a Python Program , Instead, it corresponds to a tensor program in the compilation process of machine learning .TVMScript The language of is designed to communicate with Python Corresponding to grammar , And in Python On the basis of grammar, additional structure is added to help program analysis and transformation .
print(type(MyModule)) # tvm.ir.module.IRModule
MyModule yes IRModule An example of a data structure , yes A set of tensor functions .
We can go through script Function to get this IRModule Of TVMScript Express . This function is useful for checking between step-by-step program transformations IRModule Very helpful for .
print(MyModule.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i in tir.serial(128):
with tir.block("C"):
vi = tir.axis.spatial(128, i)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
4.3 Compile and run
Use build The function will have a IRModule Into executable functions , among llvm Express CPU As a hardware execution environment .
rt_mod = tvm.build(MyModule, target="llvm") # The module for CPU backends.
type(rt_mod)
tvm.driver.build_module.OperatorModule
After compilation ,mod Contains a set of executable functions . We can get the corresponding functions by their names .
func = rt_mod["main"]
func
<tvm.runtime.packed_func.PackedFunc>
First we construct three tensors , among c Used to receive output :
a = tvm.nd.array(np.arange(128, dtype="float32"))
b = tvm.nd.array(np.ones(128, dtype="float32"))
c = tvm.nd.empty((128,), dtype="float32")
func(a, b, c)
print(c)
[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42.
43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56.
57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70.
71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84.
85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98.
99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112.
113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126.
127. 128.]
4.4 Tensor program transformation
Now we begin to transform the tensor program . A tensor program can be implemented through an auxiliary called scheduling (schedule) The data structure is transformed .
sch = tvm.tir.Schedule(MyModule)
type(sch)
tvm.tir.schedule.schedule.Schedule
We first try to split the loop .
# Get block by its name
block_c = sch.get_block("C")
# Get loops surrounding the block, Get block Outside the corresponding loop
(i,) = sch.get_loops(block_c)
# Tile the loop nesting.
i_0, i_1, i_2 = sch.split(i, factors=[None, 4, 4])
print(sch.mod.script())
take for i in tir.serial(128) Medium i Changed to i_0, i_1 and i_2. Another thing to note factors=[None, 4, 4] and tir.grid(8, 4, 4) The one-to-one correspondence between vi Equivalent to tir.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2). The latter is represented by a triple cycle .
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i_0, i_1, i_2 in tir.grid(8, 4, 4):
with tir.block("C"):
vi = tir.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
In addition, you can rearrange the loop , For example, will i_2 Move to i_1 On the outside .
sch.reorder(i_0, i_2, i_1)
print(sch.mod.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i_0, i_2, i_1 in tir.grid(8, 4, 4):
with tir.block("C"):
vi = tir.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
Another common transformation is parallelization , For example, parallel operation on the outermost loop .
sch.parallel(i_0)
print(sch.mod.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i_0 in tir.parallel(8):
for i_2, i_1 in tir.grid(4, 4):
with tir.block("C"):
vi = tir.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
5. summary
2.4. summary
The meta tensor function represents the single unit computation in the computation of machine learning model .
- A machine learning compilation process can selectively transform the implementation of meta tensor functions .
A tensor program is an efficient abstraction of a tensor function .
- Key ingredients include : Multidimensional arrays 、 A nested loop 、 Calculation statement .
- Program transformation can be used to speed up the execution of tensor programs .
- Additional structures in tensor programs can provide more information for program transformation .
边栏推荐
- Fastjson V2 简单使用手册
- leetcode:104. Maximum depth of binary tree
- 远程办公期间,项目小组微信群打卡 | 社区征文
- Apache服务器OpenSSL升级
- What does the software test report contain? How to obtain high quality software test reports?
- What does the &?
- How to realize the center progress bar in wechat applet
- 电脑设备管理器在哪里可以找到
- Smart streetlights | cloud computing lights up the "spark" of smart cities
- [Android, kotlin, tflite] mobile device integration deep learning light model tflite (object detection)
猜你喜欢

How to change the win11 computer name? Win11 method of changing computer name

Based on the open source stream batch integrated data synchronization engine Chunjun data restore DDL parsing module actual combat sharing

腾讯3年,功能测试进阶自动化测试,送给在手工测试中迷茫的你

Braces on the left of latex braces in latex multiline formula

In depth analysis of Apache bookkeeper series: Part 4 - back pressure

B_ QuRT_ User_ Guide(31)

The Sandbox 正在 Polygon 网络上进行部署

Cas classique multithreadé
![[450. delete nodes in binary search tree]](/img/fd/bab2f92edeadd16263f15de6cc4420.png)
[450. delete nodes in binary search tree]

Go language learning notes - Gorm usage - database configuration, table addition | web framework gin (VII)
随机推荐
Femas: cloud native multi runtime microservice framework
[micro service ~nacos] configuration center of Nacos
2022-06-30:以下golang代码输出什么?A:0;B:2;C:运行错误。 package main import “fmt“ func main() { ints := make
latex中 & 号什么含义?
What if the taskbar is blank after win11 update? Solution to blank and stuck taskbar after win11 update
What does the software test report contain? How to obtain high quality software test reports?
在线客服聊天系统源码_美观强大golang内核开发_二进制运行傻瓜式安装_附搭建教程...
latex字母头顶两个点
严格次小生成树
[golang] golang实现截取字符串函数SubStr
Label Contrastive Coding based Graph Neural Network for Graph Classification
在线客服系统代码_h5客服_对接公众号_支持APP_支持多语言
Mysql:sql overview and database system introduction | dark horse programmer
Go language learning notes - Gorm usage - database configuration, table addition | web framework gin (VII)
腾讯3年,功能测试进阶自动化测试,送给在手工测试中迷茫的你
[introduction to MySQL] the first conversation · first time in the "database" Mainland
Neo4j load CSV configuration and use
How to use filters in jfinal to monitor Druid for SQL execution?
理想中的接口自动化项目
B_ QuRT_ User_ Guide(31)