当前位置:网站首页>How to select the appropriate partition key, routing rule and partition number

How to select the appropriate partition key, routing rule and partition number

2022-06-22 08:14:00 abckingaa

1 Selection of sub table key

What is the sub table key

Sub table key sub-treasury / Sub table field ,zebra It's called dimension , It is a data table field used to generate splitting rules during horizontal splitting .Zebra Split the data table horizontally into each physical sub database according to the value of the sub table key .

The first principle of data table splitting , It is to find the main body of the data in the data table in the business logic as far as possible , And identify most of ( Or core ) Database operations are carried out around the data of this subject , Then, you can use the field corresponding to the entity as the table splitting key , Divide the database and tables .

The subject of business logic , It is usually related to business application scenarios , The following typical application scenarios have clear business logic subjects , It can be used for sub table key :

  • User oriented Internet applications , All operations are performed around the user dimension , Then the business logic subject is the user , You can use the field corresponding to the user as the table splitting key ;

  • Focus on the seller's e-commerce application , All operations are conducted around the seller dimension , Then the business logic subject is the seller , You can use the field corresponding to the seller as the table splitting key ;

And so on , Other types of application scenarios , Most of them can also find the appropriate business logic entity as the choice of table splitting key .

If you really can't find the appropriate business logic entity as the table key , Then you can consider the following methods to select the table splitting key :

  • According to the balance of data distribution and access, the sub table key is considered , Try to distribute the data in the data table relatively evenly in different physical sub databases / In the sub table , It is applicable to the application scenario of a large number of analytical queries ( The query concurrency can be maintained at 1);

  • According to the figures ( character string ) Type and time type fields are combined as table splitting keys , Carry out sub database and sub table , It is applicable to the application scenario of log retrieval .

Be careful : Whatever split key is selected , What kind of split strategy to use , You should pay attention to whether there is a hot spot in the split value , Try to avoid hot data to select the split key .

Be careful : It is not necessary to use the database primary key as the table key , Other business values can also be used as sub table keys . The advantage of using a primary key as a sub table key is that it can be hash balanced , Reduce hot issues .

How to handle multiple sub table keys

In most scenes , The query criteria of a table are relatively simple , Only one sub table key is required ; But sometimes , The business must have multiple sub table keys , There is no way to be one . At this time, there are generally four ways to deal with :

Definition of noun :

  • Main sub table key = Main dimension , In the main dimension , Data can be added, deleted, modified and queried ;

  • Auxiliary sub table key = Secondary dimension , On the auxiliary dimension , Only data query can be performed

Scan the whole table on the primary dimension

because SQL There is no primary dimension in , Therefore, when querying auxiliary dimensions , You can only query the tables of all primary dimensions once , And then aggregate . at present zebra The concurrency granularity of is at the database level , That is to say, if you divide 4 Databases ,32 A watch , Will end up with 4 Threads to query concurrently 32 A watch , Finally, the results are combined and output .

Applicable scenario : The number of query requests for auxiliary dimensions is very small , And it is operation query , The performance requirements are not high   Redundant synchronization of multi-dimensional data

Data of primary dimension , adopt binlog The way , Synchronize to the auxiliary dimension . When querying auxiliary dimensions , It will fall on the data of the auxiliary dimension for query .

Applicable scenario : The number of query requests for auxiliary dimensions is also considerable , The first full table scanning method cannot be used directly

Two dimensions are ingeniously reduced to one dimension

The auxiliary dimension is also the primary dimension sometimes , For example, in the order form Order in ,OrderID and UserID It's actually one-to-one ,Order The primary dimension of the table is UserID,OrderID It is an auxiliary dimension , But because of OrderID Among them 6 Bit and UserID Exactly the same , in other words , stay OrderID I will put UserID Hit in .

When routing , If SQL With medium UserID, Then take it directly UserID Conduct Hash Modular routing ; If SQL There's something in it OrderID dimension , Then take out OrderID Medium 6 position UserID Conduct Hash Modular routing , The results are consistent .

Applicable scenario : The auxiliary dimension and the primary dimension can share information through the values of the primary dimension and the auxiliary dimension

Building index tables

For auxiliary dimensions, you can create a mapping table between auxiliary dimensions and primary dimensions .

for instance , surface A There are two dimensions , Main dimension a, Auxiliary dimension b, Currently, there is only one data of the primary dimension .

here , If there is SQL: select * from A where b = ? To come over , It is bound to perform a full table scan on the primary dimension .

So build a new table B_A_Index, There are only two fields ,a and b Value , This table can be divided into tables , It can be done without tables , It is suggested to divide the table. The main dimension of this table is b.

So you can check first :select a from B_A_Index where b = ?, To obtain a Value , then Inquire about select * from A where a = The queried value and b = ? The query .

Try scenarios : The primary and secondary dimensions correspond to each other . Advantage is , No data redundancy , Only one redundant index data is required . The disadvantage is that , The business needs to be slightly transformed .

2 Selection of the number of slices

zebra There are two levels of horizontal splitting in : Sub database and sub table .  Table number decision

In general , It is recommended that the capacity of a single physical sub table should not exceed 1000 Ten thousand rows of data . It is usually possible to predict 2 To 5 Data growth in , Divide the estimated total amount of data by the total number of physical sub Libraries , Divide by the recommended maximum amount of data 1000 ten thousand , The number of physical sub tables to be created on each physical sub database can be obtained :

  • ( future 3 To 5 Total number of record lines in the year ) / Recommended record rows for a single table ( Recommended record rows for a single table = 1000 ten thousand )

The number of tables should not be too large , Involving aggregate queries or split table keys on multiple tables SQL sentence , It will be concurrent to more tables for query . for instance , Divide up 4 A watch and a minute 2 There are two cases in the table , One needs to be concurrent to 4 Execute... On the table , One only needs to be concurrent to 2 Execute on the table , Obviously, the latter is more efficient .

The number of tables should not be too small , The disadvantage of less capacity is that once the capacity is insufficient, it will have to be expanded , However, it is troublesome to expand the database of sub database and sub table . Generally, it is recommended that one time is enough .

The number of suggestion tables is 2 The number of powers of , Facilitate possible future migrations .

Library number decision

  • Calculation formula : Calculated in terms of storage capacity = (3 To 5 Storage capacity in the year )/ Recommended storage capacity for a single library ( Recommended storage capacity for a single library <300G within )

DBA The operation of , In general , Several sub databases will be put on one instance . In the future, once the capacity is insufficient , To migrate , Usually, the database is migrated . So the number of libraries is the final decision on the capacity .

Worst case scenario , All sub databases share the database machine . Best case , Each sub database has its own database machine . It is generally recommended to store on a database machine 8 Database sub databases .

3 Selection of table splitting strategy

The way of dividing tables explain advantage shortcoming Try scenarios
Hash Take the value of the score table key Hash Take the module for routing . The most commonly used table splitting method .
  • Data volume hash balance , Each table has roughly the same amount of data .
  • Request pressure hash equalization , There are no access hotspots
Once the existing table data volume needs to be expanded again , Need to involve data movement , More trouble . Therefore, the general recommendation is that one-time distribution is enough . Online services . Generally, it is in the form of UserID perhaps ShopID Etc hash.
Range Take the score table key and press ID Range for routing , such as id stay 1-10000 In the first table ,10001-20000 In the second table , By analogy . In this case , The sub table key can only be of numeric type .
  • The amount of data is controllable , Can balance , It can also be unbalanced
  • Expansion is convenient , Because if ID The scope is not enough , Just adjust the rules , Then build a new table .
Can't solve hot issues , If a piece of data access QPS Very high , It will fall on the single table for operation . Offline services .
Time Take the score table key to route according to the time range , Like when 1 The month is in the first table , stay 2 The month is in the second table , By analogy . In this case , The sub table key can only be of time type . Expansion is convenient , Because if the time frame is not enough , Just adjust the rules , Then build a new table .
  • The amount of data is uncontrollable , There may be a large amount of data in a single table , It is possible that the amount of data in a single table is very small
  • Can't solve hot issues , If a piece of data access QPS Very high , It will fall on the single table for operation .
Offline services . For example, tables used for offline operations 、 Log tables, etc
原网站

版权声明
本文为[abckingaa]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/02/202202220529316589.html