当前位置:网站首页>[elt.zip] openharmony paper Club - memory compression for data intensive applications
[elt.zip] openharmony paper Club - memory compression for data intensive applications
2022-06-27 18:51:00 【InfoQ】
- This article from the
ELT.ZIPThe team ,ELT<=>Elite( elite ),.ZIP In compressed format ,ELT.ZIP That is to compress the elite .
- member :
- Sophomore at Shanghai University of engineering and Technology
- Sophomores of Hefei Normal University
- Sophomores at Tsinghua University
- Freshman of Chengdu University of Information Engineering
- Freshmen of Heilongjiang University
- Freshman of South China University of Technology
- We are from
6 A place toClassmate , We areOpenHarmony Growth plan gnawing paper Clubin , AndHuawei 、 Soft power 、 Runhe software 、 Rio d information 、 ShenkaihongWait for the company , Study and ResearchOperating system technology…
Uncover Google Apps EROFS The reason for going to Android
EROFS Why can it shine in the field of file compression systems
The mystery of the LZ4m How to bring new vitality to the memory compression field

Introduce
Data intensive (data-intensive)
Data intensive WKdmLZ4mWKdmLZ4 analysis
In terms of algorithm

- LZ4 Mainly in its
The sliding windowAndHashtablepart , Slide the window every time you scan 4 Byte input stream , And check whether the string in the window has appeared before .
- To assist in matching ,LZ4 A hash table is maintained , And will 4 A string of bytes is mapped from the beginning of the input stream .
- If the hash table contains the string in the current sliding window , Then the matching string will continue from the current scan position , Two substrings with the same prefix can match a longest substring , The corresponding hash table entry is updated to the current starting scanning position .
- Slide the window to continue moving , Constantly update the entries of substrings that are not in the hash table , Until the sliding window reaches the end .
structurally

- The input stream is encoded into a coding unit , One
Coding unit (Encoding unit)fromThe first one (Token)andThe main body (Body)Two parts .
- Every
Coding unitAre subject to 1 Bytes ofThe first oneStart ,The first oneThe first four digits of are used to indicateThe main body (Body)OfLiteral lengthSize , The last four digits of the first part are used to indicateMatching lengthSize .
- If the string exceeds 15 byte , That's the first
Literal lengthOf 4 All the seats are 1 when (1111), The literal length of the first part will be subtracted 15 And put it onThe first onehinderThe main bodyOn .
The main body (Body)fromLiteral dataAndMatch descriptionform , amongMatch descriptionfromBackward match offsetAndMatching lengthform .
- The offset in the body is determined by 2 Byte encoding , therefore LZ4 It goes back to 64 KB(
2^16/1024) To find a match .
- Follow closely on a
matching (Match)The rest after thatmatching (Match)Code in a similar way , Only the text length field in the tag is set to0000, alsoThe main body (Body)omittedLiteral (Literal)part .
LZ4m analysis
General compression algorithm 
Offset (Offset) The literal length of the first part The matching length of the head The literal offset of the body The first one (Token) Literal length (Literal length) Matching length (Match length) The main body (Body) Match offset (Match offset)
The first one (Token) The main body assessment
- take LZ4 and LZO1x Evaluation is the representative of general algorithm , take WKdm Evaluated as a professional Algorithm . The paper collects the memory data by exchanging the data cleared from the main memory .
- The compression ratio is the average of the pages , The smaller the compression ratio, the smaller the compression size of the same data .WKdm Has the largest compression ratio , The second is LZ4m,LZ4, And finally LZO1x, The velocity is normalized to LZ4m. And general algorithm ( namely LZ4 and LZO1x) comparison ,LZ4m Shows comparable compression ratios , Only reduced 3%.
- LZ4m It is superior to these algorithms in speed up to 2.1× and 1.8× Used for compression and decompression respectively .LZ4m The compression ratio and decompression speed are higher than WKdm many , But the price is that the compression speed decreases 21%. Sum up ,LZ4m In case of loss of compression ratio , A substantial increase in LZ4 Compression of / Decompression speed .

- The following figure shows the cumulative distribution of page compression .LZ4m The compression ratio curve of is only inferior to LZO and LZ4 Some algorithms , There is no big difference . and WKdm Show obvious compression ratio curve , Far behind other algorithms . also 6.8% Your page is simply unusable WKdm Compress , The proportion of using other pages is less than 1%. This shows that WKdm The compression acceleration of can be offset by its poor compression ratio
- Further comparison 4 The meaning of byte granularity matching offset and matching length , We will start with tracking the matching length , As shown in the original LZ4 and LZ4m The length of the matching substring is calculated in the result , Compare with cumulative match count . Magnified LZ4 and LZ4m The matching length is 0 To 32 Between the original results , The increased granularity only reduces the occurrence of total length matching 2.5%, It means 4 The byte granularity scheme has little effect on the chance of finding a match , The disadvantage in matching length is also negligible .

- Relationship between time and compression ratio , By measuring the compression time of each page and averaging the time of pages with the same compression ratio, the compression speed of the algorithm can be obtained . The time to compress a well compressed page is similar in the algorithm . Compared with LZ4 and LZO1x,LZ4m Shows excellent compression speed . because LZ4m The scanning process , If no prefix match is found , The scan window will be advanced 4 Bytes , This increases scanning speed by four times on pages that are difficult to compress .
- The decompression speed of the algorithm and the average decompression speed divided by the compression ratio , The speed is obtained in the same way as the average compression speed . LZ4m The decompression speed is better than other algorithms in almost the whole range of compression ratio .

Conclusion
- LZ4 It is the most efficient compression algorithm at present , More emphasis on compression and decompression speed , The compression ratio is not the first .
- A popular universal compression algorithm is optimized by using the inherent characteristics of memory data , According to the data ,LZ4m Can greatly improve the compression / Decompression speed , There is no substantial loss of compression ratio .
- LZ4m Optimized for small block size . The maximum offset is 270( stay LZ4 In Chinese, it means 65535).
- LZ4m The developers plan to use this new compression algorithm in real-world memory compression systems . But from 2017 No more code can be found after years .

reference
边栏推荐
- MySQL中的行转列和列转行
- Teach you to use elastic search: run the first hello world search command
- Recommend several open source IOT platforms
- Row to column and column to row in MySQL
- 在arcgis中以txt格式导出点的坐标
- 【协会通知】关于举办人工智能与物联网领域暑假专题师资培训的通知
- Camera calibration with OpenCV
- 国产数据库认证考试指南汇总(2022年6月16日更新)
- 数据分析师太火?月入3W?用数据告诉你这个行业的真实情况
- Market status and development prospect forecast of global triisopropyl chlorosilane industry in 2022
猜你喜欢

PostgreSQL数据库WAL——资源管理器RMGR

Keras深度学习实战(12)——面部特征点检测

Open source summer 2022 | opengauss project selected and announced

Bit.Store:熊市漫漫,稳定Staking产品或成主旋律

Rxjs mergeMap 的使用场合

推荐几个开源的物联网平台

Application of scaleflux CSD 2000 in Ctrip

New products, new personnel and new services, Infiniti will continue to plough into China's future!

How to view the index information of MySQL tables?

MFS distributed file system
随机推荐
How to rewrite tdengine code from 0 to 1 with vscode in "technical class"
SQL update批量更新
How can Seata performance be improved? For example, add a computing node to the database?
Daily leetcode force deduction (31~35)
Asemi rectifier bridge kbp210 parameters, kbp210 specifications, kbp210 dimensions
电脑安全证书错误怎么处理比较好
【ELT.ZIP】OpenHarmony啃论文俱乐部—见证文件压缩系统EROFS
银河麒麟V10系统激活
Market status and development prospect forecast of global off-road recovery rope industry in 2022
Asemi rectifier bridge kbp307 parameters, kbp307 details, kbp307 pictures
为什么要从 OpenTSDB 迁移到 TDengine
TP5 generates the most detailed two-dimensional code tp6 (also available)
Characteristics of time series data
「技术课堂」如何用 VSCode 从 0 到 1 改写 TDengine 代码
阿里巴巴的使命、愿景、核心价值观
【协会通知】关于举办人工智能与物联网领域暑假专题师资培训的通知
PostgreSQL之存储过程篇
im即时通讯开发之双进程守护保活实践
[UVM foundation] UVM_ Is in agent_ Active variable definition
Camera calibration with OpenCV