Ordinary people analyze problems , Always from the problem phenomenon , Cause analysis , Solutions to analyze and think about problems , I want to analyze the horizontal split of this database in this way .
Start with the problem phenomenon , As the data in the database table accumulates over time , When the number of records in the table reaches tens of millions or even billions , The access efficiency of database tables has decreased significantly , As a result, the access efficiency of outer applications is very poor , The visit time rises sharply , User experience declines . If the table data is too large, the access speed becomes slow , Generally, when accessing the business related to this table, the speed will be very slow , And the speed of accessing businesses unrelated to this table will be very fast .
Analyze the above problem phenomenon , One obvious reason is that there are too many data records in some tables , It leads to the decrease of database access efficiency .
Since there are too many table data records , Of course, our solution is to reduce the data records of these tables so that the access efficiency is not affected , At the same time, in order to consider the future, these data will continue to grow , In order to expand these data after growth , It needs to consider how these data can be split horizontally without limitation , There is no need to modify the upper application , Generally speaking, as long as the design is proper , Theoretically speaking, horizontal splitting can be infinitely expanded .
Let's first divide the table with too many records into multiple tables , And here comes the question .
1、 We split the table with many records , What should I do with some watches associated with it ? This problem is also a common problem in real development , Now database tables are generally associated with other tables . Someone will propose a method that all tables are not associated with other tables , At least in the SQL At the executive level , In this way, the problem of data or business association is solved , But there is a problem here, that is, if we follow this pattern SQL The layers are completely decoupled , And at the application level, if it is related , It will increase the number of database accesses , And the network transmission data increases , such as A Table and B Table is an associated table , If in SQL Level correlation , Then only one SQL; If in SQL Level independence , You need to perform two SQL, Find out respectively A Table data and B Table data , Because there is no conditional association filtering , Then the data must be better than performing correlation SQL A lot more , And then we associate them in the application layer . So I personally feel that in systems with high performance requirements , Still need to use SQL Level related , But here is a principle that must be observed , That is, you cannot associate multiple tables that need to be split , Because this will lead to inconsistent splitting standards, which will make it impossible to split . To relate SQL A table in needs to be split , Other tables are relatively static without splitting , The solution in this case is to split the table to be split into multiple libraries , The static table is synchronized to each split Library . Here rise again , Analyze the table structure of the system , Generally, it is divided into dynamic tables ( The data changes a lot , Tables with a large amount of data ) And static tables ( A table with little data change , Generally speaking, they are basic tables , The amount of data will not be very large ), Put the basic static tables into a common library , Split the dynamic table into sub database according to the standard , After splitting, the basic data is maintained in the Public Library , And synchronize to the sub database , Maintain dynamic tables in sub databases , At the same time, when querying, the dynamic table can be associated with the static table in the sub database , This solves the problem .
2、 What is the standard for splitting database tables , According to what to split ? Generally speaking, this splitting standard can be divided according to the data range , such as 1-100 In case of a watch ,100 ten thousand -200 Wan is another watch ; It can also be split in chronological order , For example, the data of a year is classified into one table ; It can also be divided according to geographical scope , For example, according to the city , One library for each or more cities , Anyway, I think it is divided according to the specific situation , In general , For database tables with thick segmentation marks , It can be divided according to the division mark , For database tables without strong segmentation marks , You can only split it according to the most stupid method, such as data range , Sometimes in order to increase the splitting quality , You can also divide the table according to a segmentation flag , Split the database table horizontally in a compound split method such as partitioning according to another split flag .
3、 After the horizontal split of the database table , Accessing database tables SQL You must bring a split flag to determine the target database table , If you want to query multiple split database tables , You need to access the database table many times to complete , At the same time, the data will be consolidated at the application level to achieve . But sometimes , Generally, there are at least two ways ( Or segmentation marks ) To get the target database table . for instance , There are too many players in large online games ( For example, when it reaches tens of millions or even billions ), Therefore, the user information of players is divided into databases and tables , When the user logs in , Now the general practice is to let users choose which area they want , Determine the target database table based on this selection , But if we change , Users don't know which area they belong to , Only know your user number , After entering the user number, the system needs to automatically route to the target database table according to the user number . under these circumstances , Personally, I think there needs to be a rule to ensure the regularity of user numbers , For example, when users apply , Generate different user numbers for different selected areas , such as 1 Area is aaa+8 Sequence number of bits ,2 Area is bbb+8 Sequence number of bits , In this case , Yes aaa,bbb Such classification numbers can be managed through database tables , For example, the user number is aaa Beginning and abc It starts with 1 Users of the zone can manage such rules , So when the user enters the user number , The system intercepts the first three digits of the user number , And find out the corresponding area of the first three digits in the database table , In this way, you can get the target database table of this user . Wait until you find out the user information , There must be segmentation mark information in this user information ( In this example, it belongs to which district ), Cache this user information , It is no longer necessary to use the previous method to determine the target database table , The target database table can be determined only according to the information of which area in the cached user information belongs . Analyze these two ways to determine the target database table , Generally speaking, the former method is more complicated , Performance consumption is also high , This method can only be used when the second method cannot be judged , So the frequency of use is relatively low , The second way is relatively simple , And the performance is very good , This is the default usage , The frequency of use is relatively high , But sometimes the second method cannot be used because of incomplete information , Therefore, there must be the former way to supplement .
4、 When the database is horizontally split into multiple libraries , At this time, there is one thing that must happen , That is the connection of the database . Generally speaking , The application server or application system usually manages the database connection through the connection pool , This can greatly improve efficiency , Instead of using dynamic connections . Here comes a question , When a database is horizontally split into multiple databases , Inevitably, the database connection pool will also increase to multiple , There should be no problem in a certain range , But after all, there is an upper limit to the performance of the application server , When the database is horizontally split into N Database , The performance of the application server will be unbearable , At this time, the application server needs to be expanded , For example, an application server corresponds to several database connections , Of course, this is a more in-depth thing , Here we just throw out the problem , Don't say more .
5、 At this point, I want to talk about the problems of application design and development , First, the routing module of the target database table must be independent , If the route is not independent , After that, if the routing policy changes, it will die miserably , And routing algorithm is a typical application of policy pattern , It's best to realize the strategic mode , So that the application can seamlessly switch the routing policy when the routing policy changes in the future . The second is for development , It's better to use ibatis And so on , There are certain packages , Can also be SQL Independent configuration , In this way, when developers develop , Can write flexibly SQL To implement logic , Also can be SQL Statement Management , meanwhile DBA It can be very convenient for these SQL Professional optimization , It has nothing to do with application development . Now some pioneers have been trying to implement projects that encapsulate the impact of database splitting in agents , For example, amoeba project , The emergence of these projects will make the impact of database splitting on application development smaller and smaller . The last one is transaction , If you can accept the performance of distributed transactions, it is certainly the best ; If not accepted , The general practice is business compensation ( Refers to transactions in the same business operation A Submit , But the business B After error rollback , To maintain operational consistency and data correctness , Things have to be done A The reverse operation of the operation to compensate for the transaction A Submission of , Eliminate transactions A The impact of Data Submission on data results ), However, transaction compensation will increase the workload of development ; Or not a very important business operation , By guaranteeing transactions B The success rate of execution ( For example, query or pre execute first ), So that the transaction B The failure rate of has dropped to a negligible level , Thus, the problem of transactions can be ignored .
6、 There is another very important thing to say , Generally speaking, many of them are the transformation of the original system , In this case, it is necessary to cut through the processing of the original data , This work is also very important , The database splitting scheme is the best , If the original data cannot be seamlessly cut into the new split database , It's all for nothing . There are also business level issues , For example, business process changes caused by database table splitting , Problems such as changes in business operation habits should also be considered and solved in advance .
All in all , Horizontal splitting of database tables is very complicated , We need to consider and improve all aspects , Apply netizens cauherk That's what I'm saying “ System segmentation is a very complex technical activity , Think about it comprehensively , Not just from the database level . Use of business 、 The principle of sub database 、 Data cutover 、 Intrusion of development 、 The ease of operation 、 Later management and so on are all factors that need to be considered .”
Reproduced in :
http://www.cnblogs.com/glorysword/archive/2013/07/03/3168920.html