当前位置:网站首页>Database storage series (1) column storage
Database storage series (1) column storage
2022-07-27 14:26:00 【ag9920】
Work together , Grow up together ! This is my participation 「 Nuggets day new plan · 8 Yuegengwen challenge 」 Of the 1 God , Click to see the event details
OLTP vs OLAP
From the perspective of data storage and retrieval , We usually divide database systems into two categories :
- System for transaction processing Online Transaction Processing, namely OLTP;
- System for data analysis Online analytical processing, namely OLAP.
This division reflects our efficiency in data storage , The expectation of query efficiency in different scenarios .
TP You often encounter the scene of , You are in E-commerce app Place an order , Turn on Tiktok to watch videos, which are classified as TP, The requirements for database operations in such scenarios are : High availability , Low delay , Query is generally more direct .
AP Our scenario focuses on large-scale data analysis , We often need to look at some statistical indicators , For example, the age of users in a certain area , like , Expenses, etc. , The amount of computation in this kind of scenario is much larger than TP Of , Sometimes complex foreign key associations are involved , But the requirement for delay is not too high , The calculation time can be tolerated .
According to the traditional solution ,TP and AP The technology required is different , For example, use Oracle,MySQL And other relational databases TP problem , While using MapReduce,Spark Wait for big data infrastructure to solve AP problem .
With the development of the data age and the progress of Technology , We are beginning to encounter more and more TP and AP Cross scene , Need basic computing power , But it's not necessarily complicated , At the same time, we need to ensure availability and low latency . Let's get rid of OceanBase, TiDB And so on HTAP Don't talk about the plan , Think about the solution from the data itself .
Scene analysis
Imagine the scene :
Now you have a relational data table , Semantically, it represents User Entity , Include name Name, height Height, Age Age, Salary Salary, Home address Address These attributes . You need to be based on ( height , Age , Salary ) These three statistical fields , Provide efficient formula calculation Services , It is required that the formula result can be calculated immediately when inserting new data .
for example :
- Calculate the average age ;
- Calculate total revenue ;
- Calculate the average height .
If the data scale is small , We get all the data back , Computing in memory is ok 了 , Still low latency , Tens of thousands of rows of data are not under any pressure . But from I/O From the perspective of , Using traditional relational database 【 Bank deposit 】 The efficiency of the model is very low .
For example, calculate the total income , We need to read all the row records from the disk , Then calculate the total . In fact, we just need 【 Salary 】 This column is enough . Imagine assuming you are using a classic Key Value Storage , How much is the cost of this income ?
The answer is the whole table ! You need to load all the records , Although the data you want is only one column , But the organization of data on disk is 【 That's ok 】, Unless all the lines are loaded , You can't get all the values in that column .
In this way I/O The cost is obviously too high , For example, I have many columns of statistical indicators , Is it difficult to do any formula calculation , I need a full table scan ? This disk and network I/O It's all unacceptable .
It's easy to understand , Otherwise, people will not TP and AP At first, they separated . Scenarios requiring calculation , It is bound to be unable to escape a large amount of data loading , This is intuitively mutually exclusive with low latency .
thus , The industry put forward 【 Column to save 】 The concept of .
The column type storage
Unlike row storage, which stores the data of each row continuously , Column storage stores the data of each column continuously .
Analysis scenarios often require a large number of rows to be read, but a few columns , In train storage mode , Just read the columns involved in the calculation , Great Reduced IO cost, Speed up queries .
The data in the same column belongs to the same type , The compression effect is remarkable , Memory of the same size can store more data .
Classic row storage database :Oracle, MySQL, SQL Server. and HBase,Cassandra, Clickhouse Is the leader in the list and storage database .
When you need to sort a column , When aggregating , You can directly load the data in this column . At the storage level , Because the compression effect is better , Take up less disk space . At the same time, you need to load the amount of data , Also reduced even more than one level .
But at the same time, inventory also has its costs :
- Write amplification : A record originally under the line save scenario is updated , When it comes to the scenario of column storage, you need to update multiple columns . Because column storage stores different columns in discontinuous space on disk , The disk is a random write operation when multiple columns are updated ; In row storage, multiple columns in the same row are stored in a continuous space , A disk write operation can be completed , The random write efficiency of column storage is much lower than that of row storage .
- The high compression rate of column storage will also become a disadvantage in the update scenario , Because when updating, you need to unzip the stored data and update , And then compress , Last write to disk .
Bank deposit vs Column to save
Line exists insert/update/delete/point lookup query The scene of is better , Because the row data involved is continuously stored , Theoretically, there is no read-write amplification . Such as dealing with a query, By using table Indexes , It can be quickly addressed to the page , Then according to the index at the end of the page, it can quickly address to the beginning of the line , Return the data to . This feature is very consistent with OLTP Of workload scene , So in OLTP Scenes mainly use line storage .
But Xingcun is not perfect , For example, you need to traverse the whole table to get the rows that meet the requirements , But only some columns are taken for grouping / Sort / Polymerization and other operations , Bank deposit is not suitable , When reading , Because a large number of invalid column data will be read , And a large amount of data , In the era when storage is the bottleneck of the system, it is undoubtedly a disaster , And it will affect the memory cache Efficiency of use . At the time of calculation , Because row data is stored together sequentially in memory , So for cpu cache And unfriendly .
Inventory is a good way to solve the above problems , First, you only need to read the concerned column data , It is also right in calculation cpu cache Very friendly , So there are a lot of data analysis scenarios with complex queries (OLAP) It mainly uses column storage . There are obvious defects in the column update scenario , Every time insert/update/delete A line of data , Because it will update the existing in different locations column, Will bring I/O Zoom in , And it is random I/O.
Based on the advantages and disadvantages of column storage , Column storage is generally used in offline big data analysis and statistics scenarios , This scenario is mainly used for some single columns , And after the data is written, there is no need to update or delete .
summary
There is no absolute difference between row deposit and column deposit , It mainly depends on the use scenarios of developers . What your calculations need 【 Column 】 The more , The higher your write frequency , The higher the cost of using inventory . The advantages of line storage can only be reflected in specific business scenarios , If there is no such business scenario , Then the advantages of line storage will no longer exist , Even become a disadvantage , A typical scenario is a massive amount of data for statistics .
边栏推荐
- 解气!哈工大被禁用MATLAB后,国产工业软件霸气回击
- 【论文精读】Grounded Language-Image Pre-training(GLIP)
- [training day4] sequence transformation [thinking]
- Carla notes (04) - client and world (create client, connect world, batch object, set weather, set lights, world snapshots)
- 力扣SQL语句习题,错题记录
- 平板模切机
- 第3章业务功能开发(添加线索备注,自动刷新添加内容)
- 【多线程的相关内容】
- HDU4565 So Easy!【矩阵连乘】【推导】
- Travel notes from July 11 to August 1, 2022
猜你喜欢

Zhishang technology IPO meeting: annual revenue of 600million, book value of accounts receivable of 270Million

Mining enterprise association based on Enterprise Knowledge Map

面试八股文之·TCP协议

【科普】精度和分辨率的区别与联系

Shell编程规范与变量

一篇文章看懂JS执行上下文

MySQL advanced II. Logical architecture analysis

平板模切机

Alibaba's latest equity exposure: Softbank holds 23.9% and caichongxin holds 1.4%

【STM32】EXTI
随机推荐
力扣SQL语句习题,错题记录
Charles tutorial
Schematic diagram of C measuring tool
Recursive method to realize the greatest common divisor
2022 Niuke multi School II_ E I
JS什么是声明提前?函数与变量声明提前的先后顺序(执行上下文铺垫篇)
Weice biological IPO meeting: annual revenue of 1.26 billion Ruihong investment and Yaohe medicine are shareholders
井贤栋等蚂蚁集团高管不再担任阿里合伙人 确保独立决策
Pure C handwriting thread pool
C#测量工具示意图
10 practical uses of NFT
文献翻译__tvreg v2:用于去噪、反卷积、修复和分割的变分成像方法(部分)
架构——MVC的升华
Alibaba's latest equity exposure: Softbank holds 23.9% and caichongxin holds 1.4%
致尚科技IPO过会:年营收6亿 应收账款账面价值2.7亿
Interview eight part essay · TCP protocol
递归方法实现最大公约数
汉字风格迁移篇---对抗性区分域适应(L1)Adversarial Discriminative Domain Adaptation
一篇文章看懂JS执行上下文
线程知识总结