当前位置：网站首页>4、 Model optimizer and inference engine

4、 Model optimizer and inference engine

2022-07-28 06:18:00 【Aaaaaki】

Four 、 Model optimizer and inference engine

1 DLDT Introduce

DLDT： Deep learning deployment Kit （Deep Learning Deployment Toolkit）

adopt DLDT, The model can be transformed into IR File deployment on Intel Supported hardware devices , In this process, the original model goes through two levels of Optimization: model optimizer and inference engine , The specific process is shown in the figure below .

Insert picture description here

2 Model optimizer （Model Optimizer）

Model optimizer ： Cross platform command line tools , Be responsible for transforming the models of various in-depth learning frameworks into IR file , So that the inference engine can read it 、 Loading and reasoning .
Characteristics of model optimizer
- The use of the model optimizer is independent of the hardware environment , All the processing of the model is completed without knowing the final deployment equipment , therefore IR Files can run on any supported device .
- Generated IR Documents can be found in AI Used repeatedly in the reasoning process of application , convert to IR After the document , The accuracy of the model will decrease slightly , But the performance will become stronger
- stay ./OpenVINO/deployment_tools/model_optimizer/extensions/front/ Under the path , You can find the actual code of each layer of the model , It can be customized on this basis , With LayerNorm Layer as an example , Part of the code is as follows ：
- If a layer in the model is not supported , You can choose to create a custom layer . if OpenVINO Your topology is not supported , You can use the corresponding method to clip and paste the model network , And replace some of its parts or subgraphs with supported structures .
Model optimizer function
- Models of various deep learning frameworks can be transformed into IR file
- Model network operations can be mapped to supported libraries 、 Kernel or layer
- Pretreatment operation can be carried out , If –reverse_input_channel Convert the input channel sequence , from RGB Convert to BGR
- It can be optimized for Neural Networks , Adjust the input batch size and input size of neural network
- The format of model data or weight can be adjusted , Such as FP32、FP16 And INT8, Different devices support different data formats , As follows ：
- Editable network model
- Support to build custom layer
Use the model optimizer to optimize SSD-Mobilenet Model
- Install the necessary components
  The necessary step in using the model optimizer is to ensure that the necessary components are installed in the model , Get into ./model_optimizer/install_prerequisites/ Under the table of contents , function bat file , Select the framework installation to run the script , Install the necessary components for all software suites
- Download the model through the model downloader
  python downloader.py --name ssd_mobilenet_v2_coco -o output_dir
  The content of the downloaded model is as follows ：
  - pd The file is the model frozen at the end of the training , All variables in the frozen model have fixed values , If the model is not frozen , The model needs to be frozen
  - pipeline.config The file is an interpretation file of network topology , It needs to be used by the model optimizer . To find the parameters to be used by the model optimizer , Need to access ./deployment_tools/open_model_zoo/models/public/ Under the folder of the corresponding model
  - Can be found in yml file Find the parameters required by the model optimizer
- according to yml file explain , Running the model optimizer will freeze pb file Convert to IR file , Run the following command
```
python $mo_dir$\mo.py 
--input_model $model_path$\frozen_inference_graph.pb 
--reverse_input_channels 
--input_shape=[1,300,300,3] 
--input=image_tensor 
--transformations_config=$model_optimizer_path$\extensions\front\tf\ssd_v2_support.json 
--tensorflow_object_detection_api_pipeline_config=$pipeline_path$\pipeline.config 
--output=detection_classes,detection_scores,detection_boxes,num_detections 
--model_name ssd-mobilenet
```
- use IR file Reasoning , Perform the detection task
  It should be noted that , about layers new edition OpenVINO No longer support , Need to be commented out , Otherwise, it will prompt openvino.inference_engine.ie_api.IENetwork’ object has no attribute ‘layers’ error
  Comment out this section , It can run normally
  The output image is shown in the figure below ：
Clip and paste network model
- Look for each layer id The name of the corresponding layer
  find "layer id" mobilenetv2-7.xml
  grep "layer id" mobilenetv2-7.xml | head -10
- function mo.py , take Input Change to the specified layer name
```
python mo.py --input_model mobilenetv2-7.onnx 
--reverse_input_channels 
--output_dir $output_path$ 
--input mobilenetv20_features_conv0_fwd 
--model_name mobilenetv2-7-no-head
```
  notes ： In official teaching , It specifies mean_values And scale_values Value , But in personal experiments , After cutting and pasting the model, you will be prompted scale_values Value mismatch , Therefore, its input value is not specified .

3 Inference engine （Inference Engine）

Reasoning engine optimization
- IR When the model is generated, it is not optimized for specific operating equipment , But in IR After the file is input into the inference engine , The inference engine is responsible for the specific hardware environment IR File optimization
- Because various hardware devices have different instruction sets and memories , Therefore, the inference engine adopts a flexible plug-in architecture to implement environment configuration . The existence of plug-in architecture makes it possible to use almost the same code to perform tasks on completely different devices
- Each plug-in has its own specific library , With CPU Medium MKL-DNN For example ,MKL-DNN For all Intel CPU Corresponding kernel 、 Layer or function to implement neural network optimization , If the library does not support your layer , You can build a custom layer and register it with the inference engine .
- Before reasoning , The inference engine maps the network to the correct library unit , And send the network to the hardware plug-in , Perform multiple levels of hardware specific optimization .
  - Network level optimization ： All operations are not mapped to the kernel , But mapping the relationship between them , Such as data reorganization . This can improve network performance , Minimize data conversion time during reasoning
  - Memory level optimization ： Reorganize data in memory according to the requirements of specific devices
  - Kernel level optimization ： The inference engine will choose the correct implementation that best suits the architectural instruction set , if CPU Support AVX512 Instruction set , It will be used to implement the scheme
Inference engine API
- Intel For all Intel The architecture hardware device provides a simple and unified set of API, Its special plug-in architecture supports optimizing reasoning performance and memory usage , Mainly adopts C++ Language
- API Interface ：
  - IECore class ： Defines an inference engine object , There is no need to specify a specific device
    - read_network()： Read in IR Functions of files
    - load_network()： Load the network to the specified device ,HETERO The plug-in is responsible for returning the execution of the unsupported layer to other devices , Such as HETERO：FPGA,CPU ;MULTI The plug-in makes it possible to run each inference call on different devices , So as to make full use of all the equipment in the system , Execute reasoning in parallel , Such as ： device_name = MULTI：MYRIAD,CPU
  - InferRequest class ： Used for reasoning tasks

4 Performance evaluation

Inference engine workflow
- Declare inference engine objects , Read neural network , And load the network into the plug-in . According to the actual size of network input and output blob Reasoning
- It should be noted that , Model Accuracy Not equal to performance ,Accuracy Just a measure of deep learning , Not equal to performance , Even the higher the accuracy of the model, the greater the amount of parameters may be , The more susceptible performance is
Factors affecting model performance
- throughput ： The number of frames that the neural network can process in one second , Unit is “ Reasoning per second ”, use FPS Express
- Delay ： Time from data analysis to result reading , The unit is “ millisecond （ms）”
- efficiency ： The unit is “ Watt ” or “ Frames per second per unit price ”, Based on power consumption or price factors of the system
Factors affecting the performance of Neural Networks
- Topological structure or model parameter quantity of neural network
- Heterogeneous equipment ：CPU、GPU、FPGA、AI Accelerator （VPU Visual processing unit ）, For example, building applications , First determine where the neural network runs , stay Video Reasoning on computing devices , stay GPU Run video processing on , stay CPU Run other tasks and logic on
- Model accuracy （ data format ）：Intel The instruction set architecture has many packaged data types , You can package many data in one packed data type and then perform one operation on all data at once , That is, single instruction 、 More data
  - SSE4.2： It can be packed 16 Bytes of INT8 data , And perform the same operation on all data in one clock cycle
  - AVX2： It can be packed 32 Bytes of INT8 data
  - AVX512： It can be packed 64 Bytes of INT8 data
  It's generating IR The calibration process is carried out after the document （Calibration）, The calibration process converts as many layers of data as possible into integers without compromising accuracy , Its advantage is that smaller data types will occupy smaller memory space , It can reduce the amount of operation 、 Speed up execution . If the model data format is integer , Available VNNI Vector neural network instructions in Intel DL Boost On convolution layer 3 Performance improvement .
- The batch ： Increasing the batch processing level can improve the calculation efficiency , But large batches can also lead to increased delays
- Asynchronous execution ： Asynchronous processing of each frame can bring a huge increase in throughput
- Throughput Pattern ： By monitoring the parallel number , control CPU Intelligent allocation of resources , And assign multiple inference requests ,CPU The more cores , The more efficient the function is