当前位置：网站首页>[original] the influence of arm platform memory and cache on the real-time performance of xenomai

[original] the influence of arm platform memory and cache on the real-time performance of xenomai

2020-11-07 20:37:00 【Muduo】

1. Summary of problems
2. stress Memory pressure principle
2. cache factors
- 2.1 Not pressurized
- 2.2 compression （cpu/io）
3. Memory management factors

1. Summary of problems

Yes ti am5728 xenomai System latency When testing , It was found during the test that , Memory pressure on latency It's a huge impact , Without adding memory, the data under pressure is as follows ( notes ： All tests in this article use the default gravity, For real-time tasks cpu Already used isolcpus=1 Isolation , In addition, the conclusion in this paper may only be true to ARM The platform works )：

 stress -c 16 -i 4  -d 2 --hdd-bytes 256M

	user-task ltaency	kernel-task ltaency	TimerIRQ
minimum value	0.621	-0.795	-1.623
Average	3.072	0.970	-0.017
Maximum	16.133	12.966	7.736

Add parameter --vm 2 --vm-bytes 128M Simulate memory pressure .（ establish 2 Processes simulating memory pressure , Keep repeating ： Request memory size 128MB, On the application memory every 4096 Write a character at byte ’Z‘, And then read it to see if it's still ’Z‘, Check and release , Back to the application process ）

 stress -c 16 -i 4  -d 2 --hdd-bytes 256M --vm 2 --vm-bytes 128M

After adding memory pressure latency, test 10 minute ( Not measured due to time 1 Hours ), The test data are as follows ：
vm-load-all

	user-task ltaency	kernel-task ltaency	TimeIRQ
minimum value	0.915	-1.276	-1.132
Average	3.451	0.637	0.530
Maximum	30.918	25.303	8.240
Standard deviation	0.626	0.668	0.345

You can see , After adding memory pressure ,latency The maximum value is the maximum value without memory pressure 2 times .

2. stress Memory pressure principle

stress Tools for memory pressure related parameters have ：

-m, --vm N fork N Processes vs. memory malloc()/free()
--vm-bytes B The memory size of each process operation is B bytes ( Default 256MB)
--vm-stride B every other B Byte access to a byte ( Default 4096)
--vm-hang N malloc sleep N Seconds later free ( No sleep by default )
--vm-keep Allocate memory only once , Release until the end of the process

This parameter can be used to simulate different pressures ,--vm-bytes Represents the amount of memory allocated each time .--vm-stride every other B Byte access to a byte , The main simulation is cache miss The situation of .--vm-hang Specify the time to hold memory , Assign frequency .

For the above parameters --vm 2 --vm-bytes 128M , Representation creation 2 Processes simulating memory pressure , Keep repeating ： Request memory size 128MB, On the application memory every 4096 Write a character at byte ’Z‘, And then read it to see if it's still ’Z‘, Check and release , Back to the application process . Looking back on our questions , Among them, the variables that affect real-time performance are ：

(1). Memory allocation size

(2).latency During the test stress Whether to allocate / Free memory

(3). Whether memory uses access

(4). The step size of each memory access

Further summary of memory real-time impact factors are ：

cache influence
- cache miss High rate
- Memory rate （ bandwidth ）
memory management
- Memory allocation / Release operation
- Memory access page missing （MMU congestion ）

The following test parameters are designed for these effects , Test and check .

2. cache factors

close cache Can be used to simulate 100％ Cache miss , To measure the worst-case impact of cache miss that may be caused by congestion such as memory bus and off chip memory .

2.1 Not pressurized

am5728 There's no testing here L1 Cache Influence , Main tests L2 cache, Configure kernel shutdown L2 cache, Recompile the kernel .

System Type  --->
	[ ] Enable the L2x0 outer cache controller

For confirmation L2 cahe It's closed , Use the following procedure to verify , The application size is SIZE individual int Of memory , Add... To integers in memory 3, first for In steps of 1, the second for In steps of 16( Every integer 4 byte ,16 individual 64 byte ,cacheline It's just the size of 64). Because of the back for The cycle step size is 16 , In the absence of cache when , the second for The execution time of the loop should be the first for Of 1/16, To verify L2 Cache It's closed .

open L2 cache In the case of two for The execution time of is 2000ms:153ms（13 times ）, close L2 cache The last two for The execution time of is 2618ms:153ms（17 times , Greater than 16 The reason is that the same memory is used here , No physical memory has been allocated after the memory request , first for During the loop, some page missing exception handling will be performed , So it takes a little longer ）.

#include<stdlib.h>
#include<stdio.h>
#include<time.h>
#define SIZE 64*1024*1024

int main(void)
{
        struct timespec time_start,time_end;
        int i;
        unsigned  long time;
        int *buff =malloc(SIZE * sizeof(int));

        clock_gettime(CLOCK_MONOTONIC,&time_start);
        for (i = 0; i< SIZE; i ++) buff[i] += 3;//
        clock_gettime(CLOCK_MONOTONIC,&time_end);

        time = (time_end.tv_sec * 1000000000 + time_end.tv_nsec) - (time_start.tv_sec * 1000000000 + time_start.tv_nsec);
        printf("1:%ldms  ",time/1000000);

        clock_gettime(CLOCK_MONOTONIC,&time_start);
        for (i = 0; i< SIZE; i += 16 ) buff[i] += 3;//
        clock_gettime(CLOCK_MONOTONIC,&time_end);

        time = (time_end.tv_sec * 1000000000 + time_end.tv_nsec) - (time_start.tv_sec * 1000000000 + time_start.tv_nsec);
        printf("64:%ldms\n",time/1000000);
        free(buff);
        return 0;
}

Without pressure , Test off L2 Cache Before and after latency situation ( Test time is 10min), The data are as follows ：

	L2 Cache ON	L2 Cache OFF
min	-0.879	2.363
avg	1.261	4.174
max	8.510	13.161

As can be seen from the data ： close L2Cache after ,latency Overall rise . Without pressure ,L2 Cahe High hit rate , Improve code execution efficiency , It can significantly improve the real-time performance of the system , The same piece of code , Execution time is shorter .

2.2 compression （cpu/io）

No memory compression , Test only CPU Computing intensive tasks and IO Under pressure , L2 Cache It's right to close or not latency Influence . The pressure parameters are as follows ：

stress -c 16 -i 4

The same test 10 minute , The data are as follows ：

	L2 ON	L2 OFF
min	0.916	1.174
avg	4.134	4.002
max	10.463	11.414

Conclusion ：CPU、IO Under pressure ,L2 Cache It doesn't seem so important whether it's closed or not

analysis ：

Without pressure ,L2 cache At rest , Real time tasks cache High hit rate ,latency So the average is low . When off L2 cache after ,100％ cache Not hit , Both the average and maximum values increased .
add to CPU、IO After the pressure ,18 Computing processes snatch cpu resources , For real-time tasks , When the real-time task preempts the runtime ,L2 Cache Has been filled with data from the stress calculation task , For real-time tasks, it's almost 100% Not hit . therefore CPU、IO Under pressure ,L2 Cache Close or not latency almost .

3. Memory management factors

After the first 2 Section test , Whether there is... Under pressure cache Of latency Almost the same , Can be ruled out Cache. Let's test memory allocation / Release 、 Memory access page missing （MMU congestion ） Yes latency Influence .

3.1 Memory allocation / Release

stay 2 Add memory allocation to relieve pressure , The size of the test pair is 1M、2M、4M、8M、16M、32M、64M、128M、256M Under the memory allocation release operation of latency The data of , Every test 3 minute , For testing MMU congestion , The step size of allocated memory is '1' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' Memory access for , The test script is as follows ：

#!/bin/bash
test_time=300 #5min
base_stride=1
VM_MAXSIZE=1024
STRIDE_MAXSIZE=('1' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096')

trap 'killall stress' SIGKILL

for((vm_size = 64;vm_size <= VM_MAXSIZE; vm_size = vm_size * 2));do
        for stride in ${STRIDE_MAXSIZE[@]};do
                stress -c 16 -i 4 -m 2 --vm-bytes ${vm_size}M --vm-stride $stride  &
                echo "--------${vm_size}-${stride}------------"
                latency -p 100  -s -g ${vm_size}-${stride} -T $test_time -q 
                killall stress >/dev/null
                sleep 1
                stress -c 16 -i 4 -m 2 --vm-bytes ${vm_size}M --vm-stride $stride --vm-keep &
                echo "--------${vm_size}-${stride}-keep----------------"
                latency -p 100  -s -g ${vm_size}-${stride}-keep -T $test_time -q 
                killall stress >/dev/null
                sleep 1
                stress -c 16 -i 4 -m 2 --vm-bytes ${vm_size}M --vm-stride $stride --vm-hang 2 &
                echo "--------${vm_size}-${stride}-hang----------------"
                latency -p 100  -s -g ${vm_size}-${stride}-hang -T $test_time -q 
                killall stress >/dev/null
                sleep 1
        done
done

L2 Cache open , When allocating and releasing memory of different sizes latency Data mapping , The horizontal axis is the memory size of each application of memory pressure task , The longitudinal axis is at this pressure latency Maximum , as follows ：

You can see from the above picture that , The two inflection points are respectively 4MB,16MB , Distribute / The released memory is 4MB within latency Unaffected , Keep it at a normal level , The memory released by allocation is greater than 16MB when latency achieve 30us above , In line with the question . Thus we can see that ： Ordinary Linux The release of memory allocation for tasks can affect real-time performance .

3.2 MMU congestion

According to the kernel page size 4K, stay 3.1 The basis of Add parameters to –vm-stride 4096, To make stress Every time you access memory They're all missing pages , To simulate the MMU congestion ,L2 cache off The test data are plotted as follows ：

L2 cache on The test data are plotted as follows ：

MMU Congestion has little effect on real-time performance .

4 summary

After the separation of various factors, the test shows that , After applying memory pressure , The poor real-time performance is due to the release of memory allocation , It shows that the platform runs on cpu0 It's ordinary Linux The task's request to release memory will affect the operation running in cpu1 Real time performance of real-time tasks on .

am5728 There are only two levels cache, L2 Cache stay CPU Idle time can significantly improve real-time performance , but CPU When the load is too heavy L2 Cache Change in and out frequently , Not good for real-time tasks Cahe hit , Almost no real-time help .

For more information, refer to another article in this blog ： It's good for improving xenomai Some real-time configuration suggestions

版权声明
本文为[Muduo]所创，转载请带上原文链接，感谢