当前位置:网站首页>Machine learning compilation lesson 2: tensor program abstraction
Machine learning compilation lesson 2: tensor program abstraction
2022-07-05 16:58:00 【Self driving pupils】
02 Tensor program abstraction 【MLC- Machine learning compiles Chinese version 】
course home :https://mlc.ai/summer22-zh/
List of articles
In this chapter , We will discuss the abstraction of single unit computing steps and possible transformations of these abstractions in machine learning compilation .
2.1 Metatensor function
In the overview of the previous chapter , We introduce the process of machine learning compilation, which can be regarded as Transformation between tensor functions
. The execution of a typical machine learning model includes many steps to convert the input tensors into the final prediction , Each of these steps is called Metatensor function (primitive tensor function)
.
In the picture above , Tensor operator linear, add, relu and softmax Are all metatensor functions
. In particular , Many different abstractions can represent ( And the implementation ) The same meta tensor function ( As shown in the figure below ). We can choose to call the precompiled Framework Library ( Such as torch.add and numpy.add) And take advantage of Python In the implementation of . In practice , Metatensor functions are, for example C or C++ Implemented in low-level languages , And sometimes it will include some assembly code .
Many machine learning frameworks provide the compilation process of machine learning models , To transform the meta tensor function into a more specialized 、 Functions for specific work and deployment environments .
The above figure shows a meta tensor function add The implementation of is transformed to another example of a different implementation , The code on the right is a pseudocode that represents possible combinatorial optimization : The loop in the code on the left is split into a length of 4 Unit of ,f32x4.add The corresponding function is a special function that performs vector addition calculation .
2.2 Tensor program abstraction
In the previous section, we talked about the need for the transformation of tensor functions . In order that we can transform the meta tensor function more effectively , We need an effective abstraction to represent these functions .
Generally speaking , The abstraction of a typical meta tensor function implementation includes the following components : A multidimensional array that stores data
, Loop nesting that drives tensor computation
as well as The calculation part itself
The sentence of .
We call this kind of abstraction Tensor program abstraction
. An important property of tensor program abstraction is , They can be changed by a series of effective program transformations
.
for example , We can operate through a set of transformations ( Such as circular split 、 Parallelism and vectorization ) Change an initial cycle program on the left side of the above figure to the program on the right .
2.2.1 Other structures in tensor program abstraction
It is important to , We cannot change the program arbitrarily , For example, this may be because some calculations depend on the order between loops . But fortunately , Most of the metatensor functions we are interested in have good properties ( For example, the independence between loop iterations ).
The tensor program can combine this additional information into a part of the program , To make program transformation more convenient .
for instance , The program in the above figure contains additional T.axis.spatial mark , indicate vi This particular variable is mapped to the loop variable i, And all iterations are independent . This information is not necessary to execute the program , But it will make it more convenient for us to change this program . In this case , We know that we can safely parallelize or reorder all and vi About the cycle , As long as the actual implementation vi Value of from 0 To 128 The order of change .
2.3 Tensor program transformation practice
2.3.1 Install the relevant package
For the purpose of this course , We will use TVM ( An open source machine learning compilation framework ) Some of them are under continuous development . We provide the following commands for MLC Install a packaged version of the course .
python3 -m pip install mlc-ai-nightly -f https://mlc.ai/wheels
2.3.2 Construct tensor program
Let's first construct a tensor program that performs two vector addition .
import numpy as np
import tvm
from tvm.ir.module import IRModule
from tvm.script import tir as T
@tvm.script.ir_module
class MyModule:
@T.prim_func
def main(A: T.Buffer[128, "float32"],
B: T.Buffer[128, "float32"],
C: T.Buffer[128, "float32"]):
# extra annotations for the function
T.func_attr({
"global_symbol": "main", "tir.noalias": True})
for i in range(128):
with T.block("C"):
# declare a data parallel iterator on spatial domain
vi = T.axis.spatial(128, i)
C[vi] = A[vi] + B[vi]
TVMScript It's a way for us to Python The way to represent tensor programs in the form of abstract syntax trees . Notice that this code does not actually correspond to a Python Program , Instead, it corresponds to a tensor program in the compilation process of machine learning .TVMScript The language of is designed to communicate with Python Corresponding to grammar , And in Python On the basis of grammar, additional structure is added to help program analysis and transformation .
type(MyModule)
tvm.ir.module.IRModule
MyModule yes IRModule An example of a data structure , Is a set of tensor functions .
We can go through script Function to get this IRModule Of TVMScript Express . This function is useful for checking between step-by-step program transformations IRModule Very helpful for .
print(MyModule.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i in tir.serial(128):
with tir.block("C"):
vi = tir.axis.spatial(128, i)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
2.3.3 Compile and run
At any moment , We can all go through build Will a IRModule Into executable functions .
rt_mod = tvm.build(MyModule, target="llvm") # The module for CPU backends.
type(rt_mod)
tvm.driver.build_module.OperatorModule
After compilation ,mod Contains a set of executable functions . We can get the corresponding functions by their names .
func = rt_mod["main"]
func
<tvm.runtime.packed_func.PackedFunc at 0x7fd5ad30aa90>
a = tvm.nd.array(np.arange(128, dtype="float32"))
b = tvm.nd.array(np.ones(128, dtype="float32"))
c = tvm.nd.empty((128,), dtype="float32")
To execute this function , We are TVM runtime Create three NDArray, Then execute and call this function .
func(a, b, c)
print(a)
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27.
28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41.
42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55.
56. 57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69.
70. 71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83.
84. 85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97.
98. 99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111.
112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
126. 127.]
print(b)
[1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.
1. 1. 1. 1. 1. 1. 1. 1.]
print(c)
[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42.
43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56.
57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70.
71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84.
85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98.
99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112.
113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126.
127. 128.]
2.3.4 Tensor program transformation
Now we begin to transform the tensor program . A tensor program can be implemented through an auxiliary called scheduling (schedule) The data structure is transformed .
sch = tvm.tir.Schedule(MyModule)
type(sch)
tvm.tir.schedule.schedule.Schedule
We first try to split the loop .
# Get block by its name
block_c = sch.get_block("C")
# Get loops surrounding the block
(i,) = sch.get_loops(block_c)
# Tile the loop nesting.
i_0, i_1, i_2 = sch.split(i, factors=[None, 4, 4])
print(sch.mod.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i_0, i_1, i_2 in tir.grid(8, 4, 4):
with tir.block("C"):
vi = tir.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
We can reorder these cycles . Now we will i_2 Move to i_1 On the outside .
sch.reorder(i_0, i_2, i_1)
print(sch.mod.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i_0, i_2, i_1 in tir.grid(8, 4, 4):
with tir.block("C"):
vi = tir.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
Last , We can label the outermost loop that we want to parallel .
sch.parallel(i_0)
print(sch.mod.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i_0 in tir.parallel(8):
for i_2, i_1 in tir.grid(4, 4):
with tir.block("C"):
vi = tir.axis.spatial(128, i_0 * 16 + i_1 * 4 + i_2)
tir.reads(A[vi], B[vi])
tir.writes(C[vi])
C[vi] = A[vi] + B[vi]
We can compile and run the transformed program .
transformed_mod = tvm.build(sch.mod, target="llvm") # The module for CPU backends.
transformed_mod["main"](a, b, c)
print(c)
[ 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28.
29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42.
43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. 56.
57. 58. 59. 60. 61. 62. 63. 64. 65. 66. 67. 68. 69. 70.
71. 72. 73. 74. 75. 76. 77. 78. 79. 80. 81. 82. 83. 84.
85. 86. 87. 88. 89. 90. 91. 92. 93. 94. 95. 96. 97. 98.
99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109. 110. 111. 112.
113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125. 126.
127. 128.]
2.3.5 Through tensor expression (Tensor Expression,TE) Construct tensor program
In the previous example , We use it directly TVMScript Construct tensor program . In practice, , It is very helpful to construct these functions conveniently through existing definitions . Tensor expression (tensor expression) It is a program that helps us transform some tensor calculations that can be expressed by expressions into tensor programs API.
# namespace for tensor expression utility
from tvm import te
# declare the computation using the expression API
A = te.placeholder((128, ), name="A")
B = te.placeholder((128, ), name="B")
C = te.compute((128,), lambda i: A[i] + B[i], name="C")
# create a function with the specified list of arguments.
func = te.create_prim_func([A, B, C])
# mark that the function name is main
func = func.with_attr("global_symbol", "main")
ir_mod_from_te = IRModule({
"main": func})
print(ir_mod_from_te.script())
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[128, "float32"], B: tir.Buffer[128, "float32"], C: tir.Buffer[128, "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i0 in tir.serial(128):
with tir.block("C"):
i = tir.axis.spatial(128, i0)
tir.reads(A[i], B[i])
tir.writes(C[i])
C[i] = A[i] + B[i]
2.3.6 Transform a matrix multiplication program
In the example above , We show how to transform a vector addition program . Now let's try to apply some transformation to a slightly more complex program —— Matrix multiplication program . We first use tensor expressions API Construct the initial tensor program , Compile and execute it .
from tvm import te
M = 1024
K = 1024
N = 1024
# The default tensor type in tvm
dtype = "float32"
target = "llvm"
dev = tvm.device(target, 0)
# Algorithm
k = te.reduce_axis((0, K), "k")
A = te.placeholder((M, K), name="A")
B = te.placeholder((K, N), name="B")
C = te.compute((M, N), lambda m, n: te.sum(A[m, k] * B[k, n], axis=k), name="C")
# Default schedule
func = te.create_prim_func([A, B, C])
func = func.with_attr("global_symbol", "main")
ir_module = IRModule({
"main": func})
print(ir_module.script())
func = tvm.build(ir_module, target="llvm") # The module for CPU backends.
a = tvm.nd.array(np.random.rand(M, K).astype(dtype), dev)
b = tvm.nd.array(np.random.rand(K, N).astype(dtype), dev)
c = tvm.nd.array(np.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print("Baseline: %f" % evaluator(a, b, c).mean)
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[(1024, 1024), "float32"], B: tir.Buffer[(1024, 1024), "float32"], C: tir.Buffer[(1024, 1024), "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i0, i1, i2 in tir.grid(1024, 1024, 1024):
with tir.block("C"):
m, n, k = tir.axis.remap("SSR", [i0, i1, i2])
tir.reads(A[m, k], B[k, n])
tir.writes(C[m, n])
with tir.init():
C[m, n] = tir.float32(0)
C[m, n] = C[m, n] + A[m, k] * B[k, n]
Baseline: 2.967772
We can transform loops in tensor programs , Make the memory access mode more cache friendly under the new loop . Let's try the following scheduling .
sch = tvm.tir.Schedule(ir_module)
type(sch)
block_c = sch.get_block("C")
# Get loops surrounding the block
(y, x, k) = sch.get_loops(block_c)
block_size = 32
yo, yi = sch.split(y, [None, block_size])
xo, xi = sch.split(x, [None, block_size])
sch.reorder(yo, xo, k, yi, xi)
print(sch.mod.script())
func = tvm.build(sch.mod, target="llvm") # The module for CPU backends.
c = tvm.nd.array(np.zeros((M, N), dtype=dtype), dev)
func(a, b, c)
evaluator = func.time_evaluator(func.entry_name, dev, number=1)
print("after transformation: %f" % evaluator(a, b, c).mean)
@tvm.script.ir_module
class Module:
@tir.prim_func
def func(A: tir.Buffer[(1024, 1024), "float32"], B: tir.Buffer[(1024, 1024), "float32"], C: tir.Buffer[(1024, 1024), "float32"]) -> None:
# function attr dict
tir.func_attr({
"global_symbol": "main", "tir.noalias": True})
# body
# with tir.block("root")
for i0_0, i1_0, i2, i0_1, i1_1 in tir.grid(32, 32, 1024, 32, 32):
with tir.block("C"):
m = tir.axis.spatial(1024, i0_0 * 32 + i0_1)
n = tir.axis.spatial(1024, i1_0 * 32 + i1_1)
k = tir.axis.reduce(1024, i2)
tir.reads(A[m, k], B[k, n])
tir.writes(C[m, n])
with tir.init():
C[m, n] = tir.float32(0)
C[m, n] = C[m, n] + A[m, k] * B[k, n]
after transformation: 0.296419
Try to change batch_size Value , See what performance you can get . In practice , We will use an automated system to search in a possible transformation space to find the best program transformation .
2.4 summary
The meta tensor function represents the single unit computation in the computation of machine learning model . A machine learning compilation process can selectively transform the implementation of meta tensor functions .
A tensor program is an efficient abstraction of a tensor function . Key ingredients include :
Multidimensional arrays , A nested loop , Calculation statement
. Program transformation can be used to speed up the execution of tensor programs . Additional structures in tensor programs can provide more information for program transformation .
边栏推荐
- [wechat applet] read the life cycle and route jump of the applet
- [Jianzhi offer] 61 Shunzi in playing cards
- 养不起真猫,就用代码吸猫 -Unity 粒子实现画猫咪
- How to set the WiFi password of the router on the computer
- C# TCP如何设置心跳数据包,才显得优雅呢?
- 外盘期货平台如何辨别正规安全?
- 精准防疫有“利器”| 芯讯通助力数字哨兵护航复市
- C how TCP restricts the access traffic of a single client
- Writing method of twig array merging
- Global Data Center released DC brain system, enabling intelligent operation and management through science and technology
猜你喜欢
The two ways of domestic chip industry chain go hand in hand. ASML really panicked and increased cooperation on a large scale
高数 | 旋转体体积计算方法汇总、二重积分计算旋转体体积
Spring Festival Limited "forget trouble in the year of the ox" gift bag waiting for you to pick it up~
解决CMakeList find_package找不到Qt5,找不到ECM
Wsl2.0 installation
Etcd build a highly available etcd cluster
Solve cmakelist find_ Package cannot find Qt5, ECM cannot be found
Scratch colorful candied haws Electronic Society graphical programming scratch grade examination level 3 true questions and answers analysis June 2022
[729. My schedule I]
Clear restore the scene 31 years ago, volcanic engine ultra clear repair beyond classic concert
随机推荐
Allusions of King Xuan of Qi Dynasty
How to uninstall MySQL cleanly
How to install MySQL
Yarn common commands
Jarvis OJ shell流量分析
NPM installation
采用药丸屏的iPhone14或引发中国消费者的热烈抢购
SQL injection of cisp-pte (Application of secondary injection)
[61dctf]fm
If you can't afford a real cat, you can use code to suck cats -unity particles to draw cats
Accès aux données - intégration du cadre d'entité
Data access - entityframework integration
Practical example of propeller easydl: automatic scratch recognition of industrial parts
Deep dive kotlin synergy (XXI): flow life cycle function
How can C TCP set heartbeat packets to be elegant?
Error in composer installation: no composer lock file present.
Learnopongl notes (I)
解决CMakeList find_package找不到Qt5,找不到ECM
[es6] add if judgment or ternary operator judgment in the template string
关于new Map( )还有哪些是你不知道的