当前位置：网站首页>[Galaxy Kirin V10] [server] NUMA Technology

[Galaxy Kirin V10] [server] NUMA Technology

2022-07-04 10:32:00 【GUI Anjun @kylinos】

1、numa Introduce

In the early , about x86 A computer of architecture , At that time, the memory controller had not been integrated CPU, All memory accesses need to be completed through Beiqiao chip . The memory access at this time is shown in the following figure , go by the name of UMA（uniform memory access, Consistent memory access ）. Such access is very easy to achieve at the software level ： The bus model ensures that all memory accesses are consistent , There is no need to consider the differences before different memory addresses .

After that x86 The platform has experienced a change from “ Spelling frequency ” To “ Put together the number of cores ” The transformation of , More and more cores are crammed into the same chip as much as possible , The competing access of each core to memory bandwidth has become a bottleneck ; Software at this time 、OS For SMP Multi core CPU The support of is becoming more and more mature ; Plus various commercial considerations ,x86 The platform also pushed the boat forward NUMA（Non-uniform memory access, Inconsistent memory access ）. Under this framework , Every Socket There will be an independent memory controller IMC（integrated memory controllers, Integrated memory controller ）, Belong to different socket Within IMC Through between QPI link Communications .

Then there is further architecture evolution , Because each socket There will be more than one in every city core Make memory access , This will happen in every core There is a similar phenomenon in the interior of the first SMP Memory access bus with similar architecture , This bus is called IMC bus.

therefore , The obvious , Under this framework , Two socket Their management 1/2 The memory slot of , If you want to access something that does not belong to this socket Of memory must pass QPI link. That is to say, there is a local problem of memory access / long-range （local/remote） The concept of , There will be a significant difference in memory latency . That's why NUMA The reason why the performance of some applications under the architecture is worse .

Back to the present world CPU, The engineering implementation is actually more complicated . In view of , Two Socket Between them through their own 9.6GT/s Of QPI link Mutual visits . And each Socket In fact, there are 2 A memory controller . Double channel , Each controller has two more memory channels （channel）, Each channel supports up to 3 Root memory module （DIMM）. Theoretically, the largest order socket Support 76.8GB/s The memory bandwidth of , And two QPI link, Every QPI link Yes 9.6GT/s Rate （~57.6GB/s） in fact QPI link There has been a bottleneck .

Kernel NUMA Default behavior of ,Linux The kernel , The document defines NUMA Data structure and operation mode . In one, enabled NUMA Supported by Linux in ,Kernel Does not remove task memory from a NUMA node Move to another NUMA node.

Once a process is enabled , Where it is NUMA node They won't be moved , In order to optimize the performance as much as possible , In normal dispatch ,CPU Of core It will be used as much as possible local Visit the local core, Throughout the life cycle of a process ,NUMA node remain unchanged .

Once someone NUMA node The load of exceeds that of another node A threshold （ Default 25%）, I think we need to be here node Reduce the load on , Different NUMA Structure and different load conditions , The system gives a delay task migration —— Similar to the leaky cup algorithm . In this case, memory will be generated remote visit .

NUMA node There are different topologies between , each node There will be a distance between visits （node distances） The concept of , Such as numactl -H The result of the command has such a description ：node distances:
node 0 1 2 3
0: 10 11 21 21
1: 11 10 21 21
2: 21 21 10 11
3: 21 21 11 10

It can be seen that ：0 node To 0 node The distance between is 10, This must be the closest distance , Not to mention .0-1 The distance between them is far less than 2 or 3 Distance of . This distance is convenient for the system to choose the most suitable one in complex situations NUMA Set up .

2、numa Tool installation

# yum install numa* -y

3、numa see

See if... Is supported numa

# dmesg | grep -i numa

see numa state

# numastat

[[email protected]  desktop ]# numastat
                           node0
numa_hit                 2186088   #numa_hit Is to allocate memory on this node , The last number of times allocated from this node 
numa_miss                      0   #numa_miss Is to allocate memory on this node , The number of times it was finally allocated from other nodes 
numa_foreign                   0   #numa_foregin Is to allocate memory on other nodes , The number of times it was finally allocated from this node 
interleave_hit             27325   #interleave_hit Is to use interleave The number of times the policy was last allocated from that node ;
local_node               2186088   #local_node The number of times processes on this node have been allocated on this node 
other_node                     0   #other_node Is the number of times other node processes have allocated on that node 
[[email protected]  desktop ]#

#lscpu | grep -i numa // Look at each numa node Of cpu Information

see Numa node Information

# numactl --hardware

see numa The binding information

# numactl --show

[[email protected]  desktop ]# numactl --show
policy: default
preferred node: current
physcpubind: 0 1     # Nuclear binding 
cpubind: 0     #CPU binding 
nodebind: 0    #node binding 
membind: 0     # Memory binding

4、numa test

（ Accessing the memory of different nodes IO）
1) write test
# numactl --cpubind=0 --membind=0 dd if=/dev/zero of=/dev/shm/A bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.823497 s, 1.3 GB/s

# numactl --cpubind=0 --membind=1 dd if=/dev/zero of=/dev/shm/A bs=1M count=1024
1024+0 records in
1024+0 records out
1073741824 bytes (1.1 GB) copied, 0.936182 s, 1.1 GB/s

Obviously, accessing the memory of the same node is faster than accessing the memory of different nodes .

2） read test
# numactl --cpubind=0 --membind=0 dd if=/dev/shm/A of=/dev/null bs=1K count=1024K
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 1.09543 s, 980 MB/s

# numactl --cpubind=0 --membind=1 dd if=/dev/shm/A of=/dev/null bs=1K count=1024K
1048576+0 records in
1048576+0 records out
1073741824 bytes (1.1 GB) copied, 1.11862 s, 960 MB/s
Conclusion and write identical . But the gap is small .

5、numa Open and close

NUMA Can be in BIOS and OS Two level switch , In fact, there is no detailed information about the difference between the two levels of switches on the Internet . There is a saying that ：BIOS and OS The closing of the NUMA stay interleave There are differences in granularity ,BIOS It should be cache line（64B） Granularity , and OS Use the kernel page table, so it is a page （4kB） Granularity . In effect BIOS The performance should be more stable , but OS The configuration of is relatively more convenient .

BIOS layer Of NUMA Set up ：
1. see BIOS Is the layer on NUMA

# grep -i numa /var/log/dmesg

# 1. If "No NUMA configuration found"

# shows NUMA by disable.

# 2. If not "No NUMA configuration found"

# shows NUMA by enable.

2. modify BIOS interleave

Be careful , because BIOS A wide variety , Please take the actual situation as the case may be .

Parameter path ：

BIOS: interleave

The set value ：

Disable # interleave close , Turn on NUMA.