当前位置:网站首页>[original] the influence of arm platform memory and cache on the real-time performance of xenomai

[original] the influence of arm platform memory and cache on the real-time performance of xenomai

2020-11-07 20:37:00 Muduo

1. Summary of problems

Yes ti am5728 xenomai System latency When testing , It was found during the test that , Memory pressure on latency It's a huge impact , Without adding memory, the data under pressure is as follows ( notes : All tests in this article use the default gravity, For real-time tasks cpu Already used isolcpus=1 Isolation , In addition, the conclusion in this paper may only be true to ARM The platform works ):

 stress -c 16 -i 4  -d 2 --hdd-bytes 256M  
user-task ltaency kernel-task ltaency TimerIRQ
minimum value 0.621 -0.795 -1.623
Average 3.072 0.970 -0.017
Maximum 16.133 12.966 7.736

Add parameter --vm 2 --vm-bytes 128M Simulate memory pressure .( establish 2 Processes simulating memory pressure , Keep repeating : Request memory size 128MB, On the application memory every 4096 Write a character at byte ’Z‘, And then read it to see if it's still ’Z‘, Check and release , Back to the application process )

 stress -c 16 -i 4  -d 2 --hdd-bytes 256M --vm 2 --vm-bytes 128M 

After adding memory pressure latency, test 10 minute ( Not measured due to time 1 Hours ), The test data are as follows :
vm-load-all

user-task ltaency kernel-task ltaency TimeIRQ
minimum value 0.915 -1.276 -1.132
Average 3.451 0.637 0.530
Maximum 30.918 25.303 8.240
Standard deviation 0.626 0.668 0.345

You can see , After adding memory pressure ,latency The maximum value is the maximum value without memory pressure 2 times .

2. stress Memory pressure principle

stress Tools for memory pressure related parameters have :

-m, --vm N fork N Processes vs. memory malloc()/free()
--vm-bytes B The memory size of each process operation is B bytes ( Default 256MB)
--vm-stride B every other B Byte access to a byte ( Default 4096)
--vm-hang N malloc sleep N Seconds later free ( No sleep by default )
--vm-keep Allocate memory only once , Release until the end of the process

This parameter can be used to simulate different pressures ,--vm-bytes Represents the amount of memory allocated each time .--vm-stride every other B Byte access to a byte , The main simulation is cache miss The situation of .--vm-hang Specify the time to hold memory , Assign frequency .

For the above parameters --vm 2 --vm-bytes 128M , Representation creation 2 Processes simulating memory pressure , Keep repeating : Request memory size 128MB, On the application memory every 4096 Write a character at byte ’Z‘, And then read it to see if it's still ’Z‘, Check and release , Back to the application process . Looking back on our questions , Among them, the variables that affect real-time performance are :

(1). Memory allocation size

(2).latency During the test stress Whether to allocate / Free memory

(3). Whether memory uses access

(4). The step size of each memory access

Further summary of memory real-time impact factors are :

  • cache influence

    • cache miss High rate
    • Memory rate ( bandwidth )
  • memory management

    • Memory allocation / Release operation
    • Memory access page missing (MMU congestion )

The following test parameters are designed for these effects , Test and check .

2. cache factors

close cache Can be used to simulate 100% Cache miss , To measure the worst-case impact of cache miss that may be caused by congestion such as memory bus and off chip memory .

2.1 Not pressurized

am5728 There's no testing here L1 Cache Influence , Main tests L2 cache, Configure kernel shutdown L2 cache, Recompile the kernel .

System Type  --->
	[ ] Enable the L2x0 outer cache controller  

For confirmation L2 cahe It's closed , Use the following procedure to verify , The application size is SIZE individual int Of memory , Add... To integers in memory 3, first for In steps of 1, the second for In steps of 16( Every integer 4 byte ,16 individual 64 byte ,cacheline It's just the size of 64). Because of the back for The cycle step size is 16 , In the absence of cache when , the second for The execution time of the loop should be the first for Of 1/16, To verify L2 Cache It's closed .

open L2 cache In the case of two for The execution time of is 2000ms:153ms(13 times ), close L2 cache The last two for The execution time of is 2618ms:153ms(17 times , Greater than 16 The reason is that the same memory is used here , No physical memory has been allocated after the memory request , first for During the loop, some page missing exception handling will be performed , So it takes a little longer ).

#include<stdlib.h>
#include<stdio.h>
#include<time.h>
#define SIZE 64*1024*1024

int main(void)
{
        struct timespec time_start,time_end;
        int i;
        unsigned  long time;
        int *buff =malloc(SIZE * sizeof(int));

        clock_gettime(CLOCK_MONOTONIC,&time_start);
        for (i = 0; i< SIZE; i ++) buff[i] += 3;//
        clock_gettime(CLOCK_MONOTONIC,&time_end);

        time = (time_end.tv_sec * 1000000000 + time_end.tv_nsec) - (time_start.tv_sec * 1000000000 + time_start.tv_nsec);
        printf("1:%ldms  ",time/1000000);

        clock_gettime(CLOCK_MONOTONIC,&time_start);
        for (i = 0; i< SIZE; i += 16 ) buff[i] += 3;//
        clock_gettime(CLOCK_MONOTONIC,&time_end);

        time = (time_end.tv_sec * 1000000000 + time_end.tv_nsec) - (time_start.tv_sec * 1000000000 + time_start.tv_nsec);
        printf("64:%ldms\n",time/1000000);
        free(buff);
        return 0;
}

Without pressure , Test off L2 Cache Before and after latency situation ( Test time is 10min), The data are as follows :

L2 Cache ON L2 Cache OFF
min -0.879 2.363
avg 1.261 4.174
max 8.510 13.161

As can be seen from the data : close L2Cache after ,latency Overall rise . Without pressure ,L2 Cahe High hit rate , Improve code execution efficiency , It can significantly improve the real-time performance of the system , The same piece of code , Execution time is shorter .

2.2 compression (cpu/io)

No memory compression , Test only CPU Computing intensive tasks and IO Under pressure , L2 Cache It's right to close or not latency Influence . The pressure parameters are as follows :

stress -c 16 -i 4

The same test 10 minute , The data are as follows :

L2 ON L2 OFF
min 0.916 1.174
avg 4.134 4.002
max 10.463 11.414

Conclusion :CPU、IO Under pressure ,L2 Cache It doesn't seem so important whether it's closed or not

analysis :

  • Without pressure ,L2 cache At rest , Real time tasks cache High hit rate ,latency So the average is low . When off L2 cache after ,100% cache Not hit , Both the average and maximum values increased .

  • add to CPU、IO After the pressure ,18 Computing processes snatch cpu resources , For real-time tasks , When the real-time task preempts the runtime ,L2 Cache Has been filled with data from the stress calculation task , For real-time tasks, it's almost 100% Not hit . therefore CPU、IO Under pressure ,L2 Cache Close or not latency almost .

3. Memory management factors

After the first 2 Section test , Whether there is... Under pressure cache Of latency Almost the same , Can be ruled out Cache. Let's test memory allocation / Release 、 Memory access page missing (MMU congestion ) Yes latency Influence .

3.1 Memory allocation / Release

stay 2 Add memory allocation to relieve pressure , The size of the test pair is 1M、2M、4M、8M、16M、32M、64M、128M、256M Under the memory allocation release operation of latency The data of , Every test 3 minute , For testing MMU congestion , The step size of allocated memory is '1' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096' Memory access for , The test script is as follows :

#!/bin/bash
test_time=300 #5min
base_stride=1
VM_MAXSIZE=1024
STRIDE_MAXSIZE=('1' '16' '32' '64' '128' '256' '512' '1024' '2048' '4096')

trap 'killall stress' SIGKILL

for((vm_size = 64;vm_size <= VM_MAXSIZE; vm_size = vm_size * 2));do
        for stride in ${STRIDE_MAXSIZE[@]};do
                stress -c 16 -i 4 -m 2 --vm-bytes ${vm_size}M --vm-stride $stride  &
                echo "--------${vm_size}-${stride}------------"
                latency -p 100  -s -g ${vm_size}-${stride} -T $test_time -q 
                killall stress >/dev/null
                sleep 1
                stress -c 16 -i 4 -m 2 --vm-bytes ${vm_size}M --vm-stride $stride --vm-keep &
                echo "--------${vm_size}-${stride}-keep----------------"
                latency -p 100  -s -g ${vm_size}-${stride}-keep -T $test_time -q 
                killall stress >/dev/null
                sleep 1
                stress -c 16 -i 4 -m 2 --vm-bytes ${vm_size}M --vm-stride $stride --vm-hang 2 &
                echo "--------${vm_size}-${stride}-hang----------------"
                latency -p 100  -s -g ${vm_size}-${stride}-hang -T $test_time -q 
                killall stress >/dev/null
                sleep 1
        done
done

L2 Cache open , When allocating and releasing memory of different sizes latency Data mapping , The horizontal axis is the memory size of each application of memory pressure task , The longitudinal axis is at this pressure latency Maximum , as follows :

You can see from the above picture that , The two inflection points are respectively 4MB,16MB , Distribute / The released memory is 4MB within latency Unaffected , Keep it at a normal level , The memory released by allocation is greater than 16MB when latency achieve 30us above , In line with the question . Thus we can see that : Ordinary Linux The release of memory allocation for tasks can affect real-time performance .

3.2 MMU congestion

According to the kernel page size 4K, stay 3.1 The basis of Add parameters to –vm-stride 4096, To make stress Every time you access memory They're all missing pages , To simulate the MMU congestion ,L2 cache off The test data are plotted as follows :

L2 cache on The test data are plotted as follows :

MMU Congestion has little effect on real-time performance .

4 summary

After the separation of various factors, the test shows that , After applying memory pressure , The poor real-time performance is due to the release of memory allocation , It shows that the platform runs on cpu0 It's ordinary Linux The task's request to release memory will affect the operation running in cpu1 Real time performance of real-time tasks on .

am5728 There are only two levels cache, L2 Cache stay CPU Idle time can significantly improve real-time performance , but CPU When the load is too heavy L2 Cache Change in and out frequently , Not good for real-time tasks Cahe hit , Almost no real-time help .

For more information, refer to another article in this blog : It's good for improving xenomai Some real-time configuration suggestions

版权声明
本文为[Muduo]所创,转载请带上原文链接,感谢