Work together , Grow up together ！ This is my participation 「 Nuggets day new plan · 8 Yuegengwen challenge 」 Of the 1 God , Click to see the event details

OLTP vs OLAP

From the perspective of data storage and retrieval , We usually divide database systems into two categories ：

System for transaction processing Online Transaction Processing, namely OLTP;
System for data analysis Online analytical processing, namely OLAP.

This division reflects our efficiency in data storage , The expectation of query efficiency in different scenarios .

TP You often encounter the scene of , You are in E-commerce app Place an order , Turn on Tiktok to watch videos, which are classified as TP, The requirements for database operations in such scenarios are ： High availability , Low delay , Query is generally more direct .

AP Our scenario focuses on large-scale data analysis , We often need to look at some statistical indicators , For example, the age of users in a certain area , like , Expenses, etc. , The amount of computation in this kind of scenario is much larger than TP Of , Sometimes complex foreign key associations are involved , But the requirement for delay is not too high , The calculation time can be tolerated .

According to the traditional solution ,TP and AP The technology required is different , For example, use Oracle,MySQL And other relational databases TP problem , While using MapReduce,Spark Wait for big data infrastructure to solve AP problem .

With the development of the data age and the progress of Technology , We are beginning to encounter more and more TP and AP Cross scene , Need basic computing power , But it's not necessarily complicated , At the same time, we need to ensure availability and low latency . Let's get rid of OceanBase, TiDB And so on HTAP Don't talk about the plan , Think about the solution from the data itself .

Scene analysis

Imagine the scene ：

Now you have a relational data table , Semantically, it represents User Entity , Include name Name, height Height, Age Age, Salary Salary, Home address Address These attributes . You need to be based on （ height , Age , Salary ） These three statistical fields , Provide efficient formula calculation Services , It is required that the formula result can be calculated immediately when inserting new data .

for example ：

Calculate the average age ;
Calculate total revenue ;
Calculate the average height .

If the data scale is small , We get all the data back , Computing in memory is ok 了 , Still low latency , Tens of thousands of rows of data are not under any pressure . But from I/O From the perspective of , Using traditional relational database 【 Bank deposit 】 The efficiency of the model is very low .

For example, calculate the total income , We need to read all the row records from the disk , Then calculate the total . In fact, we just need 【 Salary 】 This column is enough . Imagine assuming you are using a classic Key Value Storage , How much is the cost of this income ？

The answer is the whole table ！ You need to load all the records , Although the data you want is only one column , But the organization of data on disk is 【 That's ok 】, Unless all the lines are loaded , You can't get all the values in that column .

In this way I/O The cost is obviously too high , For example, I have many columns of statistical indicators , Is it difficult to do any formula calculation , I need a full table scan ？ This disk and network I/O It's all unacceptable .

It's easy to understand , Otherwise, people will not TP and AP At first, they separated . Scenarios requiring calculation , It is bound to be unable to escape a large amount of data loading , This is intuitively mutually exclusive with low latency .

thus , The industry put forward 【 Column to save 】 The concept of .

The column type storage

Unlike row storage, which stores the data of each row continuously , Column storage stores the data of each column continuously .

Analysis scenarios often require a large number of rows to be read, but a few columns , In train storage mode , Just read the columns involved in the calculation , Great Reduced IO cost, Speed up queries .

The data in the same column belongs to the same type , The compression effect is remarkable , Memory of the same size can store more data .

Classic row storage database ：Oracle, MySQL, SQL Server. and HBase,Cassandra, Clickhouse Is the leader in the list and storage database .

When you need to sort a column , When aggregating , You can directly load the data in this column . At the storage level , Because the compression effect is better , Take up less disk space . At the same time, you need to load the amount of data , Also reduced even more than one level .

But at the same time, inventory also has its costs ：

Write amplification ： A record originally under the line save scenario is updated , When it comes to the scenario of column storage, you need to update multiple columns . Because column storage stores different columns in discontinuous space on disk , The disk is a random write operation when multiple columns are updated ; In row storage, multiple columns in the same row are stored in a continuous space , A disk write operation can be completed , The random write efficiency of column storage is much lower than that of row storage .
The high compression rate of column storage will also become a disadvantage in the update scenario , Because when updating, you need to unzip the stored data and update , And then compress , Last write to disk .

Bank deposit vs Column to save

Line exists insert/update/delete/point lookup query The scene of is better , Because the row data involved is continuously stored , Theoretically, there is no read-write amplification . Such as dealing with a query, By using table Indexes , It can be quickly addressed to the page , Then according to the index at the end of the page, it can quickly address to the beginning of the line , Return the data to . This feature is very consistent with OLTP Of workload scene , So in OLTP Scenes mainly use line storage .

But Xingcun is not perfect , For example, you need to traverse the whole table to get the rows that meet the requirements , But only some columns are taken for grouping / Sort / Polymerization and other operations , Bank deposit is not suitable , When reading , Because a large number of invalid column data will be read , And a large amount of data , In the era when storage is the bottleneck of the system, it is undoubtedly a disaster , And it will affect the memory cache Efficiency of use . At the time of calculation , Because row data is stored together sequentially in memory , So for cpu cache And unfriendly .

Inventory is a good way to solve the above problems , First, you only need to read the concerned column data , It is also right in calculation cpu cache Very friendly , So there are a lot of data analysis scenarios with complex queries （OLAP） It mainly uses column storage . There are obvious defects in the column update scenario , Every time insert/update/delete A line of data , Because it will update the existing in different locations column, Will bring I/O Zoom in , And it is random I/O.

Based on the advantages and disadvantages of column storage , Column storage is generally used in offline big data analysis and statistics scenarios , This scenario is mainly used for some single columns , And after the data is written, there is no need to update or delete .

summary

There is no absolute difference between row deposit and column deposit , It mainly depends on the use scenarios of developers . What your calculations need 【 Column 】 The more , The higher your write frequency , The higher the cost of using inventory . The advantages of line storage can only be reflected in specific business scenarios , If there is no such business scenario , Then the advantages of line storage will no longer exist , Even become a disadvantage , A typical scenario is a massive amount of data for statistics .