当前位置:网站首页>[DSP] [Part 1] start DSP learning

[DSP] [Part 1] start DSP learning

2022-07-06 20:27:00 Kshine2017

2022 year 6 month 14 Japan

1. Deployment development environment

A little . Work needs ,TMS320C6678.
Learn from today DSP Knowledge .
How to install the deployment environment will be added later .

1.1 Components

 Insert picture description here

1.1.1 SDK

  • MCSDK(CCSv5,CCSv6)+Path( All English path , No Chinese , No spaces )
    After installation , Need a patch .

  • Pocessor SDK (CCSv6)

1.1.2 Components

  • XDCTools
  • SYS/BIOS
  • NDK
  • UIA
  • XDAIS
  • Framework Components

1.1.3 Algorithm library

  • DSPLIB
  • MATHLIB
  • IMGLIB
  • VLIB

1.2 Integrated development environment

  • Code Composer Studio (CCSv5,CCSv6)

1.3 other

1.3.1 compiler

CGT 7.4.x ,CGT 8.1.x

1.3.2 code / Decoding algorithm library

  • video
    HEVC[H265], High compression efficiency , Large amount of computation .
    H264 BP/MP/HP
    MPEG4
    JPEG
    JPEG2000

  • voice
    G711
    G722/G722…1/G722.2
    G726
    G728
    OPUS

  • Medicine related

  • Telecommunications related

1.4 setup script

1.4.1 download CCS Mirror image

- Official website , Search for CCS, Download the relevant version .
https://www.ti.com.cn/tool/cn/download/CCSTUDIO
 Insert picture description here

1.4.2 install CCS Mirror image

  • If during installation , Tips VC++ Runtime installation failed , You can install it first VC++2008 as well as VC++2012 Runtime [32 position ], And then install it CCSv5.

1.4.3 Install Fonts

  • It is recommended to install equal width fonts , The width is the same in both Chinese and English .
  • Download constant width fonts , Put it in Windows Of font In the folder
  • stay CCS Code settings in the software , Font settings . Font size recommendations 12 Number , Script selection chinese GB2312.

1.4.4 Modify the theme

  • install eclipse Theme plugin ,Eclipse Marketplace, to update , Search for theme,
  • Choose a dark theme , Eye protection .

1.4.5 The menu displays Chinese

  • install eclipse Language Translation , According to Chinese .

1.5 Install software components

Download address of embedded software products :https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/index.html
Older software :https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/legacy.html

1.5.1 TI-RTOS

https://www.ti.com.cn/tool/zh-cn/TI-RTOS-MCU?keyMatch=&tisearch=search-everything&usecase=software#downloads
https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/tirtos/index.html

1.5.2 SYS/BIOS

https://www.ti.com.cn/tool/cn/SYSBIOS?keyMatch=SYS%20BIOS%20DOWNLOAD
https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/bios/index.html
 Insert picture description here
https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/bios/sysbios/index.html
 Insert picture description here

1.5.3 The Internet NDK

https://www.ti.com.cn/tool/zh-cn/NDKTCPIP?keyMatch=&tisearch=search-everything&usecase=software
https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/ndk/index.html

1.5.4 UIA

https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/uia/index.html

1.5.5 IPC

https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/ipc/index.html

  • Communication between multiple cores , Communication between multiple chips .
  • It is no longer updated

1.5.6 Multimedia Framework Products

  • Codec Engine Management resources , frame
    https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/ce/index.html

  • Framework components Provide abstract interface
    https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/fc/index.html

  • XDAIS,XDM Algorithm
    https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/xdais/index.html

1.5.7 XDCtools Real time software components

https://software-dl.ti.com/dsps/dsps_public_sw/sdo_sb/targetcontent/rtsc/index.html

1.5.8 CGT compiler

https://www.ti.com/tool/TI-CGT
https://software-dl.ti.com/codegen/non-esd/downloads/download_archive.htm

1.6 Software library

1.6.1 MATHLIB

https://www.ti.com.cn/tool/cn/MATHLIB?keyMatch=MATHLIB#downloads

1.6.2 IQMath

  • C64x+ IQMath library - Virtual floating point engine
    https://www.ti.com.cn/tool/zh-cn/SPRC542?keyMatch=&tisearch=search-everything&usecase=partmatches

1.6.3 FastRTS

  • Optimization of basic mathematical operations .

1.6.4 DSPLIB

  • Digital signal processing , Handle floating point

1.6.5 IMGLIB

  • The image processing ,DCT Transformation , Discrete cosine transform
  • Image analysis , Histogram

1.6.6 VLB

  • Higher level image processing

1.6.7 VICP

  • signal processing

1.6.8 VoLIB

  • Sound processing library

1.6.9 FaxLIB

  • Fax , codec

1.6.10 AER/AEC

  • codec

1.6.11 Codecs

  • codec
  • voice , video , Image coding

1.7 Software development package

  • MCSDK Multi core software development suite ( Older )
  • MCSDK-VIDEODEMO
  • PROCESSOR-SDK-C667X

2. Basic knowledge of

B Learn video on the Internet : Wang Jun, Beihang Electronic Information Engineering College 《DSP Architecture 》

2.1 Instruction cycle

Von Neumann structure .

  • 1 Instruction cycles = 1 One or more machine cycles .

  • 1 Machine cycles = 6 State cycles = 12 Clock cycles .6 section (s)12 pat (p).

  • 1 State cycles = 2 Clock cycles .
     Insert picture description here

  • 1 The execution of instruction cycles : Take command , decoding , perform ,( Write back to ).
    Take command : From the data register MDR, Through the bus , Read to instruction register IR in .
    decoding : Instruction decoder ID. Register the instruction IR Translate the instructions in .
    perform : Execute instruction decoder ID The instructions given .

  • stay s1p1 and s4p1 When the falling edge , , respectively, Read the instruction once ( Read one byte at a time ).
     Insert picture description here

  • Different instructions , The length of instructions is different , The execution cycle of instructions is also different . There is a single byte and a single cycle , Double byte single cycle , Single byte double cycle , Double byte double cycle , Double byte three cycles ......

2.2 Executive performance

  • When the dominant frequency is the same ,CPI The smaller it is , The better the performance .

  • clock frequency : reaction DSP implementation technique , Production process

  • CPI: reaction DSP Instruction set structure ,DSP Architecture .

  • IC: reaction DSP Instruction set structure and Compilation Technology , And programming level .
     Insert picture description here

  • Instruction cycle , It refers to the time required to execute an instruction . Usually, the ns In units of .

  • MAC Time , The time of a multiplication and a wig . Most of the DSP Multiplication and addition can be completed once in an instruction cycle .

  • FFT execution time , To measure DSP An indicator of the computing power of a chip .

  • MIPS: namely M+IPS. Execute millions of instructions per second .

  • MFLOPS, Perform millions of floating-point operations per second .

  • in application ,tdsp The time must be less than the sampling time ts,

2.2 Assembly line

  • PipeLine.
  • Using assembly line , take CPI=12(24) drop to CPI=1.

2.2.1 The type of pipeline

(1) According to the number of functions
Single function pipeline , Only one function can be completed .
Multifunctional pipeline , Can complete different functions .

(2) According to the connection mode between segments at the same time
Static pipeline , At the same time , Each segment is connected according to one function .
Dynamic pipeline , At the same time , Each segment can be connected according to different functions .

(3) According to the level of the assembly line in the photo
Intraslice , Command pipeline DLX, Computing pipeline ( In core assembly line , Inter core assembly line ).
Between films , Multi chip parallel processor .
Machine room , Multiprocessor parallel processing , Task flow .

(4) Divide according to the data representation

(5) Divide according to whether there is a feedback loop

2.2.2 Command pipeline DLX(RISC framework )

  • The architecture of the processor , Take command , decoding , perform , Visiting and depositing , Write back to .
     Insert picture description here  Insert picture description here

2.2.3 Instruction pipeline competition

(1) Structural competition , Hardware resources cannot meet the instruction pipeline execution .

 Insert picture description here

  • Von Neumann structure There is only one bus , Used for data and instruction transmission .
  • When both parts of the pipeline need to access memory , There will be competition . As shown in the figure below , When the first instruction is transmitted to the fourth link , The first link and the fourth link compete for the bus .
     Insert picture description here
  • The solution is , Move the command backward , And it must be moved to a position different from the fourth link and the first link . It's inefficient .
     Insert picture description here
     Insert picture description here
  • Another solution , Another structure , Harvard structure . Add a bus , Data and instructions are stored separately , To avoid conflict .
     Insert picture description here

(2) Data competition , When the previous instruction result cannot be output in time .

  • Rely on compilers and programming methods to solve .

  • Write before they are read Same register , There will be competition . If the method of moving back and waiting is adopted , It's going to be inefficient .
     Insert picture description here

  • Change to First learns to write Same register , There will be no competition .
     Insert picture description here

  • Read continuously perhaps Write in succession Same register , There will be no competition .
     Insert picture description here

  • Eliminate data competition , Try to avoid operating on the same register ; Moving forward does not produce competitive instructions ( Insert other instructions in the middle ).

  • Compiler choice :
    The clock of the pipeline is correct ,
    The data result of the pipeline is correct

(3) Control competition , Jump and other changes PC Value instruction pre read does not match the actual .

  • Instructions 5 Used to judge the jump direction , One is to continue executing instructions 6, One is jump back instruction 1.
  • If a jump back instruction appears 1 The situation of , Others under the assembly line 678 The part of the instruction that has been loaded needs to be cleared , Then there will be a gap in the middle , The assembly line is disconnected .
     Insert picture description here
  • Method 1 : Branch prediction , Load instructions with high execution probability .
  • Method 2 : The jump command moves forward . As shown in the figure below , Except for instructions 1 And instructions 6 Outside ,234 It must be implemented .
     Insert picture description here

2.3 bring CPI Further down

Video link 《DSP Architecture

  • It is known that , Assembly line makes CPI=12 or 24 drop to CPI=1. This section describes how to make CPI<1.
  • Traditional assembly line CPI = 1.
  • Complete multiple instructions in one clock CPI<1.

2.3.1 Super pipeline

  • CPI=1, The main frequency is increased .
  • Refine the flow , Increase the main frequency , Space for time .
  • Pentium4 Galloping 4 On the processor , application 20 Class assembly line .
  • Baidu Encyclopedia 》 Super pipelined processors are relative to benchmark processors , commonly cpu The pipeline of is basic instruction prefetching , decoding , Execute and write back result level 4 . Super pipeline (superpiplined) It refers to a certain type CPU The internal assembly line exceeds the usual 5~6 Step above , for example Pentium pro The assembly line is as long as 14 Step . The steps of pipeline design ( level ) More , The faster it completes an instruction , Therefore, it can adapt to work with higher dominant frequency CPU.

2.3.2 Very long instruction words (VLIW,very long instruction word)

  • Complete multiple instructions in one clock CPI<1
  • A very long instruction set ,8x32bit = 256bits. Multiple assembly lines are combined .
  • TI The company's C6000 Start using after series VLIW technology .
  • Program compile time , According to parallelism , Combine into multi operation super long instruction words . There will be many 32 Bit null instruction , After compiling the program , The larger , Program storage efficiency is not high .
  • VLIW Each segment corresponds to a different operation unit .D,D,M,M,L,L,S,S.
    (1) Each clock cycle starts one VLIW Instructions .
    (2) Each section controls an operation unit .
    (3) compile , When programming , Solve the competition problem , The control hardware is relatively simple .
     Insert picture description here

2.3.3 Superscalar (super scalar)

  • And VLIW The structure is similar to .
  • Execution time , According to resources , Data competition , Decide whether a unit executes .
  • On the basis of ultra long instruction set, the storage efficiency of the program is improved .

2.4 DSP Software optimization technology

  • The essence of software optimization : How to use DSP Hardware resources in .
  • We need to know what hardware resources there are .

2.4.1 understand DSP Hardware resources

sketch

  • With TI The company's TMS320C6678 For example .
  • L1 Memory (L1P Program ,L1D data ),L2 Memory .

Functional units

  • There are two register groups A and B, Each register group has 4 A functional unit MLSD.
    (1)M,multiplier Multiplier .
    (2)L,ALU Logical unit .
    (3)S,Data Data unit .
    (4)D,Control Control and jump unit .
     Insert picture description here

Swap paths

A Side and B Side exchange data .

Internal bus

Command pipeline

Three instructions , When there is no assembly line , Serial execution , need 9 Clock cycles . Adopt assembly line , It only needs 5 Clock cycles .
 Insert picture description here

Instruction delay

  • Most instructions can get results immediately .
  • Multiplication , Delay 1 pat .
  • Floating point multiplication , Delay 2 pat .
  • Data loading , You need to give an address first , The memory only returns one data , Need to delay 4 pat .
  • Jump , You need to clear the pre stored instructions , Need to delay 5 pat .
     Insert picture description here

2.4.2 DSP Software optimization

  • The purpose of software optimization : Running speed and code size .
  • Optimization programming requirements : Familiar with processor architecture , Familiar with programming language (C, assembly , Linear assembly ).
  • Learn about code generation tools ( compiler , Assembler , The linker )
  • C6X Optimize C compiler , Use ANSI C Source code , It can achieve manual optimization 80%. Need to understand various optimization levels .

Assembly optimization means

(1) Instruction parallelism .
(2) fill NOP Instructions .
(3) Loop unrolling .( eliminate sub,b Instruction overhead ).
(4) Word or double word access .

2.4.3 Refer to the optimization case

 Insert picture description here

  • Instructions for loading data , It can be executed in parallel .
  • eliminate sub+b Jump time waiting .
     Insert picture description here
     Insert picture description here
  • Loop unrolling , Eliminate the overhead of jump instructions .
     Insert picture description here
  • Use word or doubleword access
     Insert picture description here

2.4.4 Software pipelining

B Stop video 《DSP Software optimization

1. C Language implementation algorithm

2. C6x Linear assembly of

  • No delay Nop
  • No instruction parallel
  • There is no need to consider functional units and registers
     Insert picture description here

3. Design diagrams

  • In the process of getting the code , Dependency , Data path , Guide us in allocating resources .
  • Draw algorithm nodes and paths
  • Write instruction cycles consumption
  • Assign functional units to nodes
  • Divide nodes into A,B All functional units on both sides
     Insert picture description here

4. Allocation register

 Insert picture description here

5. Create a process table

  • front 7 Clock cycles , In the process of establishing a cycle .
  • The first 8 A clock cycle begins , It's a circular body .
  • The figure does not show the , The process of cycle ending processing .
     Insert picture description here

6. transformation C6x Code

  • Prolog, Cycle building

  • Loop, loop

  • Epilog, Cycle ending , The known loop body includes loading two data , Multiplication , Add . Cycle closure and cycle establish correspondence and complementarity .

  • additional load
     Insert picture description here

  • Complete schedule , Including cycle closure , Here's the picture .
     Insert picture description here

  • Loop Only, During the cycle , Realize the closing work .
    Remove all instructions except jump instructions .
    Clear the input register , Accumulator and intermediate product .
    Adjust the subtraction quantity .
     Insert picture description here

2.4.5 C Code optimization

  • Intrinsics Inline Technology .
    (1) Direct threshold C6000 Special inline functions corresponding to assembly instructions , No function call expenses .
    (2) Use C Variable name ( It's not a register ), And C Environment compatible .
    (3) Don't add C Programming workload .
    (4) Code efficiency is the same as assembly .

  • C Code :( Code efficiency is low )

y=a*b;
  • Use Intrinsics Of C Code :
y = _mpy(a,b);
  • Embedded assembly :( Easy to destroy C Environmental Science )
asm("MPY A0,A1,A2");
  • Assembly code :( There's a lot of programming )
MPY A0,A1,A2   ;a,b,y

3. other

原网站

版权声明
本文为[Kshine2017]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/187/202207061224354328.html