当前位置:网站首页>Deep learning the principle of armv8/armv9 cache
Deep learning the principle of armv8/armv9 cache
2022-06-13 02:12:00 【Code changes the world CTW】
Quick links :
.
Personal blog notes guide Directory ( All )
Catalog
- 1、 Why use cache?
- 2、 background : Changes in Architecture ?
- 2、cache The hierarchy of –--big.LITTLE framework (A53 For example )
- 3、cache The hierarchy of –-- DynamIQ framework (A76 For example )
- 4、DSU / L3 cache
- 5、L1/L2/L3 cache How big are they
- 6、cache Related terms
- 7、cache The allocation strategy of (alocation,write-through, write-back)
- 8、 The type of memory in the schema
- 9、 As defined in the architecture cache The scope of the (inner, outer)
- 10、 The type of memory in the schema (mair_elx register )
- 11、cache The type of (VIVT,PIPT,VIPT)
- 12、Inclusive and exclusive caches
- 13、cache Query process of ( unofficial , vernacular )
- 14、cache The organizational form of (index, way, set)
- 15、cache line What's in it
- 16、cache Query examples
- 17、cache Query principle
- 18、cache maintenance
- 19、 Maintain memory consistency in software – invalid cache
- 20、 Maintain memory consistency in software – flush cache
- 21、cache Introduction to consistency directive
- 22、PoC/PoU point Introduce
- 23、cache A summary of the conformance directive
- 24、Kernel Use in cache Examples of consistency directives
- 25、Linux Kernel Cache API
- 26、A76 Of cache Introduce
- 27、A78 Of cache Introduce
- 28、armv8/armv9 Medium cache Related system registers
- 29、 Between multiple cores cache Uniformity
- 30、MESI/MOESI Introduction to
1、 Why use cache?
ARM At the beginning of architecture development , The clock speed of the processor is roughly similar to the access speed of the memory . Today's processor cores are much more complex , And the clock frequency can be several orders of magnitude faster . However , The frequency of external bus and storage device is not the same . It can be implemented on a small chip that can run at the same speed as the kernel SRAM block , But with standards DRAM Block comparison , such RAM It's very expensive , standard DRAM The capacity of blocks can be thousands of times higher . Based on many ARM Processor system , Accessing external memory requires dozens or even hundreds of kernel cycles .
A cache is a small, fast block of memory between the core and main memory . It keeps a copy of the project in main memory . Access to cache memory is much faster than access to main memory . Whenever the kernel reads or writes to a specific address , It first looks in the cache . If it finds an address in the cache , It uses the data in the cache , Instead of performing access to main memory . By reducing the impact of slow external memory access time , This significantly improves the potential performance of the system . By avoiding the need to drive external signals , It also reduces the power consumption of the system 
2、 background : Changes in Architecture ?

- DynamIQ yes Arm company 2017 New generation multi-core microarchitecture published in (microarchitecture) technology , The official name is DynamIQ big.LITTLE( Hereinafter referred to as DynamIQ), Replace the one that has been used for many years big.LITTLE technology
- big.LITTLE Technology will be multi-core processors IP Divided into two clusters, Every cluster most 4 A nuclear , Two cluster most 4+4=8 nucleus , and DynamIQ One of the cluster, Most support 8 A nuclear
- big.LITTLE Large and small nuclei must be placed in different places cluster, for example 4+4(4 Big nucleus +4 Micronucleus ),DynamIQ One of the cluster in , It can contain both large and small nuclei , achieve cluster Isomerism in (heterogeneous cluster), And large and small nuclei can be arranged and combined at will , for example 1+3、1+7 And other elastic configurations that could not be achieved before .
- big.LITTLE Every cluster Only one voltage can be used , And therefore the same cluster The core of the CPU There is only one frequency ,DynamIQ Every one of them CPU Cores can have different voltages and different frequencies
- big.LITTLE Every cluster Internal CPU nucleus , Share the same piece L2 Cache,DynamIQ Every one of them CPU The core , They all have their own L2 Cache, Share the same piece L3 Cache,L2 Cache and L3 Cache The capacity size of is optional , Exclusive to each core L2 Cache It can be downloaded from 256KB~512KB, All cores share L3 Cahce It can be downloaded from 1MB~4MB. This design greatly improves the speed of cross core data exchange . L3 Cache yes DynamIQ Shared Unit(DSU) Part of

2、cache The hierarchy of ––big.LITTLE framework (A53 For example )


3、cache The hierarchy of –-- DynamIQ framework (A76 For example )

4、DSU / L3 cache
DSU-AE The system control register is realized , These register pairs cluster All in core It's all universal . It can be downloaded from cluster Any of them core Access these registers . These registers provide :
- control cluster Power management for .
- L3 cache control .
- CHI QoS Bus control and scheme ID Distribute .
- of DSU‑AE Hardware configuration information , Including the designated Split‑Lock Cluster execution mode .
- L3 Cache hit and miss count information
L3 cache
- cache size Optional : 512KB, 1MB, 1.5MB, 2MB, or 4MB. cache line = 64bytes
- 1.5MB Of cache 12 Road groups are connected
- 512KB, 1MB, 2MB, and 4MB Of caches 16 Road groups are connected
5、L1/L2/L3 cache How big are they
Need reference ARM file , In fact, every one core Of cache The size is fixed or configurable .
6、cache Related terms
reflection : What is? Set、way、TAG 、index、cache line、entry?
7、cache The allocation strategy of (alocation,write-through, write-back)
Read allocation (read allocation)
When CPU When reading data , happen cache defect , In this case, a cache line Cache data read from main memory . By default ,cache Both support read allocation .Read allocation (read allocation) Write assignments (write allocation)
When CPU Write data occurs cache When missing , Will consider the write allocation strategy . When we do not support write allocation , The write instruction will only update the main memory data , And then it's over . When write allocation is supported , We first load data from main memory into cache line in ( It's like doing a read assignment first ), And then it updates cache line Data in .Write straight through (write through)
When CPU perform store Command and in cache Hit, , We update cache And update the data in main memory .cache Always keep consistent with the data in main memory .Read allocation (read allocation) Write back to (write back)
When CPU perform store Command and in cache Hit, , We only update cache Data in . And each cache line There will be one bit Whether the bit record data has been modified , be called dirty bit( Turn over the front picture ,cache line There's one next to it D Namely dirty bit). We will dirty bit Set up . The data in main memory will only be in cache line Replaced or displayed clean Update during operation . therefore , The data in main memory may be unmodified data , And the modified data lies cache in .cache The data may be inconsistent with the main memory

8、 The type of memory in the schema

9、 As defined in the architecture cache The scope of the (inner, outer)
about cacheable attribute ,inner and outer Describe the cache Definition or classification of . For example L1/L1 As a inner, hold L3 As a outer
Usually , Internally integrated cache Belong to inner cache, External bus AMBA Upper cache Belong to outer cache. for example :
- For the above mentioned big.LITTLE framework (A53 For example ) in ,L1/L2 Belong to inner cache, If SOC Hang up L3 Words , Then it belongs to outer cache
- For the above mentioned DynamIQ framework (A76 For example ) in ,L1/L2/L3 Belong to inner cache, If SOC Hang up System cache( Or some other name ) Words , Then it belongs to outer cache
And then we can do it for each kind of cache Separate attribute configuration , for example :
- To configure inner Non-cacheable 、 To configure inner Write-Through Cacheable 、 To configure inner Write-back Cacheable
- To configure outer Non-cacheable 、 To configure outer Write-Through Cacheable 、 To configure outer Write-back Cacheable

about shareable attribute ,inner and outer Describe the cache The scope of the . such as inner Refer to L1/L2 Within the scope of cache,outer Refer to L1/L2/L3 Within the scope of cache

The following is again true of Inner/Outer Attributes make a small summary :

If you will block The memory property of is configured as Non-cacheable, Then the data will not be cached to cache, So all observer The memory you see is consistent , In other words, it is equivalent to Outer Shareable.
In fact, official documents , There is also a description of this sentence :
stay B2.7.2 chapter “Data accesses to memory locations are coherent for all observers in the system, and correspondingly are treated as being Outer Shareable”If you will block The memory property of is configured as write-through cacheable or write-back cacheable, Then the data will be cached cache in .write-through and write-back Cache policy .
If you will block The memory property of is configured as non-shareable, that core0 When accessing this memory , Cached data to Core0 Of L1 d-cache and cluster0 Of L2 cache, It will not be cached to other cache in
If you will block The memory property of is configured as inner-shareable, that core0 When accessing this memory , The data will only be cached to core 0 and core 1 Of L1 d-cache in , It will also be cached to clustor0 Of L2 cache, Not cached to clustor1 Any of them cache in .
If you will block The memory property of is configured as outer-shareable, that core0 When accessing this memory , The data will be cached to all cache in
| Non-cacheable | write-through cacheable | write-back cacheable | |
|---|---|---|---|
| non-shareable | Data will not be cached to cache ( For observation , And it's equivalent to outer-shareable) | Core0 When reading , Cached data to Core0 Of L1 d-cache and cluster0 Of L2 cache, If core0 and core1 Both read and write the memory , And in core0 core1 Of L1 d-cache The memory is cached in . that core0 While reading the data ,core0 Of L1 Dcache Will update , but core 1 Of L1 Dcache Will not update | Same as left |
| inner-shareable | Data will not be cached to cache ( For observation , And it's equivalent to outer-shareable) | Core0 When reading , Cached data to Cluster0 All in cache | Same as left |
| outer-shareable | Data will not be cached to cache ( For observation , And it's equivalent to outer-shareable) | Core0 When reading , Data cache to all cache | Same as left |
10、 The type of memory in the schema (mair_elx register )

11、cache The type of (VIVT,PIPT,VIPT)
MMU from TLB and Address Translation form :
- Translation Lookaside Buffer
- TAddress Translation

cache It is divided into ;
- PIPT
- VIVT
- VIPT

12、Inclusive and exclusive caches

Let's start with a simple memory read , Single core . Such as LDR X0, [X1], hypothesis X1 Point to main memory, And is cacheable.
(1)、Core Go first L1 cache Read ,hit 了 , Directly return data to Core
(2)、Core Go first L1 cache Read ,miss 了 , Then it will query L2 cache,hit 了 ,L2 Of cache The data will return Core, It will also lead to this cache line Replace the L1 One of the lines cache line
(3)、 If L1 L2 All are miss, that data Will read from memory , The cache to L1 and L2, And return it to Core
Then we look at a complex system , Don't consider L3, Multicore .
(1)、 If it is inclusive cache, Then the data will be cached to L1 and L2
(2)、 If it is exclusive cache, Then the data is only cached to L1, Not cached to L2
- Strictly inclusive: Any cache line present in an L1 cache will also be present in the L2
- Weakly inclusive: Cache line will be allocated in L1 and L2 on a miss, but can later be evicted from L2
- Fully exclusive: Any cache line present in an L1 cache will not be present in the L2
13、cache Query process of ( unofficial , vernacular )

Suppose a 4 Road connected cache, size 64KB, cache line = 64bytes, that 1 way = 16KB,indexs = 16KB / 64bytes = 256 ( notes : 0x4000 = 16KB、0x40 = 64 bytes)
0x4000 – index 0
0x4040 – index 1
0x4080 – index 2
…
0x7fc0 – index 2550x8000 – index 0
0x8040 – index 1
0x8080 – index 2
…
0xbfc0 – index 255
14、cache The organizational form of (index, way, set)

- All connected
- Direct connection
- 4 Road groups are connected

for example A76
L1 i-cache :64KB,4 road 256 Groups connected ,cache line by 64bytes
TLB i-cache : All connected , Support 4KB, 16KB, 64KB, 2MB,32M Page of
L1 d-cache :64KB,4 road 256 Groups connected ,cache line position 64bytes
TLB d-cache : All connected , Support 4KB, 16KB, 64KB, 2MB,512MB Page of
L2 cache :8 Road connected cache, Size optional 128KB, 256KB, or 512KB
15、cache line What's in it

Each line in the cache includes:
• A tag value from the associated Physical Address.
• Valid bits to indicate whether the line exists in the cache, that is whether the tag is valid.
Valid bits can also be state bits for MESI state if the cache is coherent across multiple cores.
• Dirty data bits to indicate whether the data in the cache line is not coherent with external memory
• data
that TAG What's in it ??(S13 Would say here TAG Is equal to... In the physical address TAG)
As follows A78 For example , It shows TAG What's in it

add :TLB What's in it ? Also think A78 For example ;

16、cache Query examples

17、cache Query principle
First use index Go to query cache, Then compare TAG, Compare tag It will also be checked when valid Sign a 
18、cache maintenance

Software maintenance operation cache There are three types of instructions :
- Invalidation: In fact, it's modification valid bit, Give Way cache Invalid , Mainly used to read
- Cleaning: In fact, it's what we call flush cache, Here will be cache Data is written back to memory , And clearly dirty sign
- Zero: take cache The data in 0, This is actually what we call clean cache.
When software maintenance is needed cache:
(1)、 When there are others Master Changed external memory, Such as DMA operation
(2)、MMU Of enable or disable Memory access for the entire interval of , Such as REE enable 了 mmu,TEE disable 了 mmu.
For (2) spot ,cache How and mmu It's related ? That's because :
mmu On and off , Affect the memory permissions, cache policies
19、 Maintain memory consistency in software – invalid cache

20、 Maintain memory consistency in software – flush cache

21、cache Introduction to consistency directive
<cache> <operation>{, <Xt>}

22、PoC/PoU point Introduce

- PoC is the point at which all observers, for example, cores, DSPs, or DMA engines, that can access memory, are guaranteed to see the same copy of a memory location
- PoU for a core is the point at which the instruction and data caches and translation table walks of the core are guaranteed to see the same copy of a memory location
23、cache A summary of the conformance directive

24、Kernel Use in cache Examples of consistency directives

25、Linux Kernel Cache API
linux/arch/arm64/mm/cache.S
linux/arch/arm64/include/asm/cacheflush.h
void __flush_icache_range(unsigned long start, unsigned long end);
int invalidate_icache_range(unsigned long start, unsigned long end);
void __flush_dcache_area(void *addr, size_t len);
void __inval_dcache_area(void *addr, size_t len);
void __clean_dcache_area_poc(void *addr, size_t len);
void __clean_dcache_area_pop(void *addr, size_t len);
void __clean_dcache_area_pou(void *addr, size_t len);
long __flush_cache_user_range(unsigned long start, unsigned long end);
void sync_icache_aliases(void *kaddr, unsigned long len);
void flush_icache_range(unsigned long start, unsigned long end)
void __flush_icache_all(void)
26、A76 Of cache Introduce
A76
L1 i-cache :64KB,4 road 256 Groups connected ,cache line by 64bytes
L1 d-cache :64KB,4 road 256 Groups connected ,cache line by 64bytes
L2 cache :8 Road connected cache, Size optional 128KB, 256KB, or 512KB
L1 TLB i-cache :48 entries, All connected , Support 4KB, 16KB, 64KB, 2MB,32M Page of
L1 TLB d-cache : 48 entries, All connected , Support 4KB, 16KB, 64KB, 2MB,512MB Page of
L2 TLB cache : 1280 entries, 5 Road groups are connected
L3 cache
cache size Optional : 512KB, 1MB, 1.5MB, 2MB, or 4MB. cache line = 64bytes
1.5MB Of cache 12 Road groups are connected
512KB, 1MB, 2MB, and 4MB Of caches 16 Road groups are connected

27、A78 Of cache Introduce
A78
L1 i-cache :32 or 64KB,4 Road groups are connected ,cache line by 64bytes , VIPT
L1 d-cache : 32 or 64KB,4 Road groups are connected ,cache line by 64bytes, VIPT
L1 TLB i-cache :32 entries, All connected , Support 4KB, 16KB, 64KB, 2MB,32M Page of
L1 TLB d-cache : 32 entries, All connected , Support 4KB, 16KB, 64KB, 2MB,512MB Page of
L2 TLB cache : 1024 entries, 4 Road groups are connected
L3 cache
cache size Optional : 512KB, 1MB, 1.5MB, 2MB, or 4MB. cache line = 64bytes
1.5MB Of cache 12 Road groups are connected
512KB, 1MB, 2MB, and 4MB Of caches 16 Road groups are connected

28、armv8/armv9 Medium cache Related system registers
ID Register
CTR_EL0, Cache Type Register
- IminLine, bits [3:0]
Log2 of the number of words in the smallest cache line of all the instruction caches that are controlled by the PE. - DminLine, bits [19:16]
Log2 of the number of words in the smallest cache line of all the data caches and unified caches that are controlled by the PE
29、 Between multiple cores cache Uniformity
about Big.LITTLE framework 
about DynamIQ framework

30、MESI/MOESI Introduction to



Events:
- RH = Read Hit
- RMS = Read miss, shared
- RME = Read miss, exclusive
- WH = Write hit
- WM = Write miss
- SHR = Snoop hit on read
- SHI = Snoop hit on invalidate
- LRU = LRU replacement
Bus Transactions:
- Push = Write cache line back to memory
- Invalidate = Broadcast invalidate
- Read = Read cache line from memory
Welcome to add wechat 、 Wechat group , Communicate more
边栏推荐
- When AI meets music, iFLYTEK music leads the industry reform with technology
- Test questions basic exercise 01 string
- [the second day of the actual combat of the smart lock project based on stm32f401ret6 in 10 days] light up with the key ----- input and output of GPIO
- [printf function and scanf function] (learning note 5 -- standard i/o function)
- Can't use typedef yet? C language typedef detailed usage summary, a solution to your confusion. (learning note 2 -- typedef setting alias)
- Top level configuration + cooling black technology + cool appearance, the Red Devils 6S Pro is worthy of the flagship game of the year
- In addition to the full screen without holes under the screen, the Red Devils 7 series also has these black technologies
- Huawei equipment is configured with IP and virtual private network hybrid FRR
- STM32 sensorless brushless motor drive
- Opencv camera calibration (1): internal and external parameters, distortion coefficient calibration and 3D point to 2D image projection
猜你喜欢

In the third quarter, the revenue and net profit increased "against the trend". What did vatti do right?

LabVIEW大型项目开发提高质量的工具

Opencv camera calibration (1): internal and external parameters, distortion coefficient calibration and 3D point to 2D image projection

(novice to) detailed tutorial on machine / in-depth learning with colab from scratch

Gome's ambition of "folding up" app

Sqlserver2008 denied select permission on object'***** '(database'*****', schema'dbo')

华为设备配置虚拟专用网FRR

Introduction to Google unit testing tools GTEST and gmoke

Top level configuration + cooling black technology + cool appearance, the Red Devils 6S Pro is worthy of the flagship game of the year

Decoding iFLYTEK open platform 2.0 is a fertile land for developers and a source of industrial innovation
随机推荐
[keras] generator for 3D u-net source code analysis py
Why is Huawei matebook x Pro 2022 leading a "laptop" revolution
Luzhengyao, who has entered the prefabricated vegetable track, still needs to stop being impatient
STM32 external interrupt Usage Summary
Rsync transport exclusion directory
Sqlserver2008 denied select permission on object'***** '(database'*****', schema'dbo')
Anti crawling strategy (IP proxy, setting random sleep time, bilbili video information crawling, obtaining real URLs, processing special characters, processing timestamp, and multithreading)
Basic exercises of test questions letter graphics ※
swiper 横向轮播 grid
[sequence structure, branch structure, loop structure, continue statement, break statement, return statement] (learning Note 6 -- C language process control)
Interruption of 51 single chip microcomputer learning notes (external interruption, timer interruption, interrupt nesting)
【Unity】打包WebGL項目遇到的問題及解决記錄
[printf function and scanf function] (learning note 5 -- standard i/o function)
Area of basic exercise circle ※
Pytoch freeze pre training weights (feature extraction and BN layer)
When AI meets music, iFLYTEK music leads the industry reform with technology
The new wild prospect of JD instant retailing from the perspective of "hour shopping"
Ctrip reshapes new Ctrip
Alertwindowmanager pop up prompt window help (Part 1)
Examples of using the chromium base library

