当前位置：网站首页>Introduction to data fragmentation

Introduction to data fragmentation

2022-07-07 08:43:00 【Blue sky ⊙ white clouds】

background

Traditional general Data sets Storage to a single node , In performance 、 The three aspects of availability and operation and maintenance cost have been difficult to meet the scenario of massive data .

In terms of performance , because Relational database Most use B+ Index of tree type , When the amount of data exceeds the threshold , The increase of index depth will also make disk access IO More times , This leads to a decline in query performance ; meanwhile , High concurrent access requests also make the centralized database the biggest bottleneck of the system .

In terms of usability , The statelessness of service , It can achieve random expansion with less cost , This will inevitably result in the final pressure of the system falling on the database . And a single data node , Or a simple master-slave architecture , It's getting harder and harder to bear . Database availability , Has become the key to the whole system .

In terms of operation and maintenance costs , When the data in a database instance reaches threshold above , about DBA The operation and maintenance pressure will increase . The time cost of data backup and recovery will become more and more uncontrollable with the amount of data . In general , The threshold of data for a single database instance is 1TB within , It's a reasonable range .

In the case that the traditional relational database can not meet the needs of Internet scenarios , Store data to native, distributed support NoSQL There are more and more attempts . but NoSQL Yes SQL The incompatibility of and the imperfection of the ecosystem , Make them in the game with the relational database has always been unable to complete a fatal blow , However, the status of relational database is still unshakable .

Data fragmentation refers to storing the data stored in a single database in multiple databases or tables according to a certain dimension, so as to improve the performance bottleneck and availability . The effective method of data fragmentation is to divide databases and tables into relational databases . Sub database and sub table can effectively avoid the query bottleneck caused by the amount of data exceeding the tolerable threshold . besides , Sub database can also be used to effectively disperse the single point of access to the database ; Although sub table can't relieve the pressure of database , But it can provide the possibility of transforming distributed transaction into local transaction as much as possible , When it comes to cross database update operations , Distributed transactions tend to complicate problems . Use the multi master and multi slave split mode , Can effectively avoid data single point , So as to improve the availability of data architecture .

The amount of data in each table is kept below the threshold by splitting the data into sub database and sub table , And traffic grooming to deal with high traffic , It's about responding to High concurrency And massive data system . Data fragmentation is divided into vertical fragmentation and horizontal fragmentation .

Vertical slice

The way of business splitting is called vertical segmentation , Also known as vertical split , Its core idea is dedicated to special storage . Before splitting , A database consists of multiple data tables , Each table corresponds to a different business . And after the split , It is to classify the table according to the business , Distributed to different databases , And then spread the pressure to different databases . The figure below shows the business needs , Scheme of vertically slicing user table and order table into different databases .

Vertical segmentation often needs to adjust the architecture and design . Generally speaking , It's too late to cope with the rapid change of Internet business demand ; and , It doesn't really solve the single bottleneck . Vertical splitting can alleviate the problems caused by data volume and access volume , But it can't cure . If after vertical split , The amount of data in the table still exceeds the threshold that a single node can carry , It needs to be further processed by horizontal sectioning .

Horizontal slice

Horizontal segmentation is also called horizontal splitting . Relative to the vertical slice , It no longer classifies data according to business logic , But through a certain field （ Or some fields ）, Spread data across multiple libraries or tables according to certain rules , Each slice contains only a part of the data . for example ： Slice according to the primary key , Even primary key records are put into 0 library （ Or table ）, The record of odd primary key is put into 1 library （ Or table ）, As shown below .

select * from t_user where id=1

select * from t_user where id=2

In theory, horizontal slicing breaks through the bottleneck of single machine data processing , And expand relative freedom , Is a standard solution for data fragmentation .

Challenge

Although data fragmentation solves the problem of performance 、 Availability and single point backup and recovery , But distributed architecture gains benefits at the same time , It also introduces new questions .

In the face of such scattered data after fragmentation , It is one of the most important challenges for application development engineers and database administrators to operate on the database . They need to know from which specific database sub tables the data needs to be obtained .

Another challenge is , Can correctly run in a single node database SQL, It doesn't always work correctly in the partitioned database . for example , Sub table results in the modification of table name , Or pagination 、 Sort 、 Incorrect handling of operations such as aggregation grouping .

Cross database transaction is also a thorny issue for distributed database cluster . Reasonable use of sub table , It can reduce the amount of data in a single table , Try to use local transactions , Good at using different tables in the same database can effectively avoid the trouble caused by distributed transactions . In a scenario where cross database transactions cannot be avoided , Some businesses still need to keep transactions consistent . And based on XA Because of the high concurrency of the distributed transaction in the scene of neutral can not meet the needs , Not used on a large scale by internet giants , Most of them use the final consistent flexible transaction instead of the strong consistent transaction .

The goal is

Try to be transparent about the impact of sub database and sub table , Let users try to use the database cluster after horizontal fragmentation just like a database .

原网站

版权声明
本文为[Blue sky ⊙ white clouds]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/188/202207070547398960.html