当前位置:网站首页>Principle and performance analysis of lepton lossless compression
Principle and performance analysis of lepton lossless compression
2022-07-05 14:16:00 【Vivo Internet technology】
author :vivo Internet database team - Li Shihai
This paper mainly introduces the outline process and principle of lossless compression pictures , as well as Lepton The problems and solutions of lossless compression found in the previous research .
One 、 Start with a game
1.1 Game fault finding
Please take out your stopwatch to time , stay 15 Find out the difference between the following pictures in seconds .

time out , Did you find the difference between the two pictures ?
Two 、 The growth of the wise
In the game above , You may not find any difference between the two pictures , In fact, one of them is 3.7MB Of jpg Original drawing of format , The other one is of size 485KB Of jpg Format compressed pictures , It's just different sizes . You may be a little angry , Indignant to this is cheating , However, smart you soon have a series of questions in your brain , These question marks let you uncover the veil of the game layer by layer , No longer regret for fooling , Instead, get happiness from new knowledge .
2.1 Socrates: midwifery
- Why does the picture above become smaller ?
- Where is the lost information ?
- Why is the picture quality declining , I can't see it ?
- Can I make it smaller ?
- Can I restore it to its original size ?
- Why compress my pictures ?
Why is the picture above smaller ? Picture from 3.7MB become 485KB It's because I used the image viewing tool to save the original image as a new image , In the process of saving , There is a parameter for image quality selection , I chose the lowest quality , After saving, a smaller picture is generated . But the quality of the picture has declined , Why can't you see ? This requires understanding the principle of image compression .
2.2 Explore the story behind the appearance
Take advantage of the weakness of the human eye .
There are two kinds of cells in human retina , Cones and rods . Cone cells are used to sense color , Rod cells are used to sense brightness . And relative to color , Our perception of light and dark is more obvious .
Therefore, color information can be compressed to reduce the size of the picture .
So we will transform the color space before image compression ,JPEG Pictures are usually transformed into YCbCr Color space ,Y For brightness ,Cb Blue chroma ,Cr Red chroma , After the transformation, it is easier for us to deal with the color part . Then we cut a picture into pieces 8*8 Pixel block of , Then use the discrete cosine transform algorithm (DCT) Calculate the high frequency region and low frequency region .
Because the human eye is not sensitive to complex information in the high-frequency region , Therefore, this part can be compressed , This process is called quantification . Finally, package the new file . This process completes the image compression .
The basic flow chart is as follows :

JPEG Compression is lossy .
In the above process , After the color space conversion of the prediction module , By discarding some color density information , Increase the compression ratio . Common options are 4:2:0, After this step, we need 8 Information represented by numbers , Now it's just a matter of 2 individual , Directly abandoned 75% Of Cb Cr Information , However, this step is irreversible , This leads to the loss of image compression . In addition, in the entropy coding module , Will further use stroke length coding or Huffman Encoding further compresses the picture information , And the compression of this part is lossless , It's reversible .

(YCbCr Space conversion )
The principle of Huffman coding is as follows :
If the total number of characters to be encoded is 38 Symbol data , Count them , The symbols and corresponding frequencies obtained are shown in the following table :

First , Sort all symbols by frequency , After sorting, see the following figure :

then , Select two leaf nodes with the lowest frequency , The one with the least frequency is the left child node , The other is the right child node , The root node is the sum of the frequencies of the two leaf nodes .



(Huffman Trees )
Go through the above steps , It forms a Huffman Trees ,Huffman Coding is often used in lossless compression , Its basic idea is to use short encoding to represent characters with high frequency , Use long encoding to represent characters with low frequency , This makes the average length of the encoded string 、 The expectation of length decreases , So as to achieve the purpose of compression .
3、 ... and 、 The protagonist of the story Lepton
Not perfect .
above JPEG Although the compression reduces the size of the image and the quality is good, it is difficult for the human eye to distinguish the differences , But because of the lossy compression , The image quality cannot be restored to the original quality , And actually at this time jpg The picture still has compressed space .
Lepton You can do it in JPEG On this basis, the image is further lossless compressed .
3.1 Why choose Lepton
And lepton Similar compression tools include jpegcan,MozJPEG,PackJPG,PAQ8PX. But these tools have some defects more or less , Make it inferior to lepton More suitable for industrial production .
such as PackJPG You need to rearrange all compressed pixel values in the file in the order of global sorting . This means that decompression is single threaded , At the same time, the whole image needs to be put into memory, which leads to high delay and low throughput of image processing .
The picture below is lepton The comparison of several tools in the paper :

3.2 Lepton What optimizations have been made .
First, on the algorithm Lepton Divide the image into two parts header And the image data itself ,header Use DEFLATE Do lossless compression , The image itself uses arithmetic coding instead of Holman coding for lossless compression . because JPEG Use Huffman code , This makes it difficult to use multithreading ,Lepton Use "Huffman Switch word " Improved .
secondly Lepton A complex adaptive probability model is used , This model was developed by testing on a large number of field images . The goal of this model is to produce the most accurate prediction of the value of each coefficient , This produces smaller files ; Allow multithreading concurrent processing in Engineering , Allow distributed processing across multiple servers in chunks , Stream processing line by line effectively controls the memory , At the same time, it also ensures the safety of data reading and output .
It is Lepton In the optimization of the above key issues , So it can be used in production environment .
3.3 Lepton stay vivo Exploration in storage
Expected earnings :
At present, there are about 100PB data , Among them, picture data accounts for about 70%, And in the picture 90% All the pictures are jpeg Type picture , If on average 23% Compression ratio of , that 100PB * 70% * 90% * 23% = 14.5PB, Will achieve approximately 14.5PB Cost savings .
At the same time, because it is lossless compression , The user experience is well guaranteed . At present lepton The design of compression function is shown in the figure below :

Current challenges :
- lepton Compression and decompression require high computing performance of the server 、 High consumption .
- Expect to make full use of idle servers CPU resources , To achieve the purpose of reducing cost and increasing efficiency .
- It has the ability to dynamically expand and shrink capacity in the face of tidal phenomena .
The main problems faced at present :
At present, most of the pictures are in 4M-5M, Tested for 4M-5M The file compression delay of size is 1s Left and right , The server is required at least 16 The core 、 bearing 5QPS. At this time, the utilization rate of each core is 95% above . so Lepton Compression of requires high computing performance . The current common solution is to use FPGA Card for hardware acceleration 、 And the horizontal expansion of a large number of computing nodes .FPGA The use of will increase the cost of hardware , Reduce the costs and benefits of compression .
Solution :
In order to solve the above problems and challenges , We try to use physical servers and Kubernetes The hybrid deployment method solves the problems of the use of computing resources and dynamic expansion , The schematic diagram of the structure is as follows :
The management of physical servers and the expansion of capacity are elastically expanded through the registration and discovery of services 、 Through this cgroup/Taskset And so on cpu Use to manage . At the same time, connect and use Kubernetes Manage as a container 、 The flexibility of the container is more suitable for this kind of computing service .
3.4 performance evaluation
Whether synchronous compression , Or asynchronous compression , Usually, we pay more attention to the delay of image reading . Reading a large number of pictures will bring great pressure to the server , The pressure mainly comes from the image decompression calculation . To improve decompression efficiency , And make full use of the company's resources , We will lepton Compressed services are distributed in cpu Idle server , According to the degree of resource idleness , free time , Make full use of the peak and valley of resources to improve computing performance .
Piezometric data :
We selected image files of different sizes , Compression and decompression tests are carried out in a stand-alone environment , The test results are as follows :

The compression ratio is kept at 22% about .

The above figure shows the time scale of file compression and decompression of different sizes , Orange is the decompression time , Blue is compression time .

The above picture shows pictures of different sizes , stay 32 Threads concurrent , Each thread processes 100 Test data of files .
Four 、 Common problems of image compression
4.1 Distinguish lossy and lossless compression by file format

4.2 Common lossless compression algorithms

5、 ... and 、 summary
Lepton Lossless compression can provide a relatively high compression ratio , At the same time, it does not affect the user's image quality and use experience 、 In the scenario of large amount of data, you will get obvious benefits .
deficiencies It requires high computing performance 、 Only support jpeg Type of picture . For performance requirements, the industry also has relatively mature solutions , As mentioned above FPGA And elastic calculation scheme . The key is to choose a reasonable plan according to the needs of the enterprise .
quote :
- 《The Design, Implementation, and Deployment of a System to Transparently Compress Hundreds of Petabytes of Image Files For a File-Storage Service》
- 《 Based on deep learning JPEG Research on image cloud storage 》
- 《JPEG-Lepton Key modules of compression technology VLSI Research on structural design 》
边栏推荐
- [buuctf.reverse] 152-154
- Laravel - view (new and output views)
- Fault analysis | analysis of an example of MySQL running out of host memory
- Scenario based technology architecture process based on tidb - Theory
- Comparison of several distributed databases
- What category does the Internet of things application technology major belong to
- 2022 driller (drilling) examination question bank and simulation examination
- Why do I support bat to dismantle "AI research institute"
- After the microservice project is deployed, static resources and files uploaded to upload cannot be accessed. Solution
- Time to calculate cron expression based on cronsequencegenerator
猜你喜欢

Make the seckill Carnival more leisurely: the database behind the promotion (Part 2)

Why do mechanical engineers I know complain about low wages?

让秒杀狂欢更从容:大促背后的数据库(下篇)

What is the future development trend of neural network Internet of things

Sqllab 1-6 exercise

Financial one account Hong Kong listed: market value of 6.3 billion HK $Ye wangchun said to be Keeping true and true, long - term work

魅族新任董事长沈子瑜:创始人黄章先生将作为魅族科技产品战略顾问

瑞能实业IPO被终止:年营收4.47亿 曾拟募资3.76亿

神经网络物联网未来发展趋势怎么样

非技术部门,如何参与 DevOps?
随机推荐
VC开发非MFC程序内存泄漏跟踪代码
金融壹賬通香港上市:市值63億港元 葉望春稱守正篤實,久久為功
LeetCode_ 2 (add two numbers)
最简单不用证书也可以多开功能的方式
The forked VM terminated without saying properly goodbye
openGauss数据库源码解析系列文章—— 密态等值查询技术详解(下)
ASP. Net large takeout ordering system source code (PC version + mobile version + merchant version)
LeetCode_ 69 (square root of x)
动态规划
TDengine 社区问题双周精选 | 第三期
R Language ggplot2 Visualization: visualize linegraph, using Legend in Theme function. Paramètre de position emplacement de la légende personnalisée
软件测试人在深圳有哪些值得去的互联网公司【软件测试人员专供版】
R language ggplot2 visualization: gganimate package is based on Transition_ The time function creates dynamic scatter animation (GIF) and uses shadow_ Mark function adds static scatter diagram as anim
Laravel - view (new and output views)
SSH免密码登录详解
How to make a second clip of our media video without infringement
Assembly language
鸿蒙第四次培训
The IPO of Ruineng industry was terminated: the annual revenue was 447million and it was planned to raise 376million
Leetcode array question brushing notes