brief introduction ： PolarDB-X It is a cloud native distributed database with separate computing and storage , stay PolarDB-X 2.0 Of AUTO In mode , The database will automatically Hash Partition , Distribute the data evenly among all data nodes , Ideally, the data and traffic between partitions are balanced , It can give full play to the distributed processing ability of multiple nodes . In order to achieve the best effect , It requires that the database try to avoid hot partitions , Including traffic hotspots and data volume hotspots . Avoid hot spots , First of all, you need to quickly and easily find hot partitions , Thus, targeted treatment can be carried out , Therefore, quickly and accurately finding hot partitions has become an important ability required by distributed databases .

background

PolarDB-X It is a distributed database with separate computing and storage , The distributed processing capability is PolarDB-X One of the core features of , Multiple computing nodes of a single database instance will share all SQL Traffic , In this way, we can quickly meet different peak traffic scenarios through the expansion and contraction of nodes .

stay PolarDB-X 1.0 Time , Users often use the method of dividing database and table to split the database and table, so as to achieve the balance of data and traffic among multiple nodes , In this mode, the selection of split key plays a key role in the performance of database , To select the best split key combination, users are required to be very familiar with the database table structure and data distribution of the business library at the beginning of creating the table .

In order to help users reduce the technical threshold of using distributed databases ,PolarDB-X 2.0 The era has introduced the concept of transparent distribution , Users no longer need to specify the split key one by one , Using a distributed database is like using a stand-alone MySQL It's as simple as , You can also enjoy the excellent features of distributed database . This is an upgrade of the user experience , It is also a leap in technical architecture and philosophy , From middleware mode to cloud native architecture , Database is no longer a high-level technical component that requires users to care about maintenance , It's a cloud service on demand , Let users fully enjoy the technological dividends brought by Cloud Architecture .

stay PolarDB-X 2.0 Of AUTO Pattern Next , The database will automatically Hash Partition , Distribute the data evenly among all data nodes , The best case is that the data and traffic between partitions are balanced , It can give full play to the distributed processing ability of multiple nodes . In order to achieve the best effect , It requires that the database try to avoid hot partitions , Including traffic hotspots and data volume hotspots . Avoid hot spots , First of all, you need to be able to quickly and easily find hotspot partitions , Thus, targeted treatment can be carried out . Therefore, quickly and accurately finding hot partitions has become PolarDB-X2.0 An important ability required .

Effect display

Functions overview

First, select a small range of data to introduce , Here's the picture , The vertical axis represents the logical library 、 Logic table 、 The relationship between logical partitions , And the partitions are sorted by logical sequence number , The horizontal axis represents time , The column chart at the bottom and right of the image shows the summary data , The bottom bar graph shows the vertical summation , That is, the sum of the visits of all partitions at a certain time , The column on the right shows the horizontal summation , That is, the sum of visits in all time ranges of a partition .

Storage node perspective

How to view the hot spots from the perspective of storage nodes , You can click on the top “DN View” Button to switch to the storage node perspective , Data will be classified according to different storage nodes , It is convenient to analyze whether the data is balanced between physical storage nodes , Whether there are hotspots of physical storage nodes .

TPC-C Hot spot analysis

use TPC-C Flow test , You can see a complete thermal distribution , It is obvious from the figure TPC-C There are two hot areas of traffic , And the hot spots of data volume can also be found through the width comparison of the vertical axis .

Design considerations

1. The presentation should be as simple and understandable as possible

Hot data has the characteristics of multi-dimensional coupling ： Data volume 、 Traffic volume 、 Time 、 The relationship between partitions 、 The relationship between logical library tables and partitions 、 The relationship between physical nodes and logical library tables 、 The difference between hot zone and cold zone , The key elements necessary for these analyses are coupled , Be short of one cannot . Clear up complex information , Give users a clear and concise presentation .

2. Avoid affecting the core functions of the database

To accurately find the hot zone , It is necessary to collect the data volume and request volume of the database , Traffic and data volume are constantly changing , Therefore, the collection process also needs to be continuous , This requires that the process of information collection should not have a negative impact on the core functions of the database .

3. Implement links to reduce dependence on external components

With PolarDB-X Product development , A variety of deployment forms have been derived , There is a public cloud version deployed on Alibaba cloud 、 It faces offline PoC Scenario deployment K8s edition 、 There are also lightweight deployments for users' private environments DBStack edition , There are also open source versions contributed to the community, and so on , In order to make as many versions as possible have the same ability , Use as few external components as possible , In this way, the compatibility problems faced by multi-modal deployment will be minimized .

4. Control the amount of data collected

Because the collection of flow data is a continuous process , In theory, there will be endless statistical data , Therefore, the size of statistical data must be limited , There should be a data aging range , Otherwise, infinite data cannot be stored . The amount of data as small as possible can also reduce the amount of IO And network pressure , Reduce the impact on the core functions of the kernel .

design scheme

Interactive mode

After comparing various types of charts , And comparison with other relevant solutions in the industry , Finally choose to use “ Heat map ” This form is used to display the partition heat information , The horizontal axis expresses time , The vertical axis represents the partition , The color brightness of the corresponding rectangle is used to indicate the level of access popularity . In a glance , The brightest rectangle is the hottest zone .

The heat map can well express the hot spots of traffic , So how to show the hot spots of data volume ？ We make innovative use of the vertical axis , The height of each zone on the vertical axis is equal , But the width can be different , The larger the amount of data in the partition , The wider the width , thus , By comparing the width, you can find the partition with the largest amount of data at a glance .

With the above two basic elements of presentation , Add some animation ： The zoom 、 Drag and drop 、 Color adjustment 、hover And other interaction effects , You can clearly and completely express the information of hotspot partitions .

Data processing

Processing of timeline

According to the display characteristics of the heat map , The frequency of data acquisition is set at 1 minute / Time , The statistics collected can be retained at most 7 God , It is estimated that there will be at most 7 *

60 = 10080 A little bit , Storing data over time requires 10080 Row data . However , The width of the web page displayed by the browser is usually 1000px Around the unit , If users want to see 7 Full data of days , that 1px The width of the unit needs to be stuffed 10 Timeline , This kind of display effect will be greatly reduced . Therefore, the data of the timeline must be processed , Reduce the identification density of the timeline , But you can't lose data .

Reduce the number of timelines , It is easy to think of a scheme to reduce the sampling accuracy , For example, change the acquisition frequency to 30 minute / Time , But if users only watch 1 Data in hours , That leaves the page 2 It's time , Obviously, it is also unacceptable . In this way, reducing the sampling frequency will lead to a contradiction ： The contradiction between the requirements of display accuracy in a small time period and the requirements of display effect in a large time period .

therefore , Finally, we choose to grade the timeline , Far time range data reduces accuracy , Data in the near time range is retained with high accuracy , This is also in line with the usage habits of most users , The latest data is more detailed . Acquisition accuracy changed to the latest 1 Data within hours 1 minute / Time , The first 1～8 Data within hours 2 minute / Time , The first 8～24 Data within hours 6 minute / Time , The first 24 Hours ～7 Data within days 30 minute / Time . In this way, the maximum number of timelines will be from 100080 Reduced to 60 + 210 + 160 + 288 = 718 individual .

Therefore, the data structure adopted is shown in the figure below , Multilayer ring queue , Each layer inserts new data from the end of the team , Select the data to be specified from the head of the team, merge it and insert it into the tail of the next layer , Then delete the merged data from the team head . Each ring has a specified size , Merge data downward when the ring is full , Directly discard data when the last ring is full .

Processing of partition axis

In order to avoid the dependence of external components , Therefore, the scheduler of the kernel is used , Initiate a collection task every minute at the main computing node , The task is pushed down to each storage node to obtain the original data , Finally, it is processed on the main computing node . thus it can be seen , The performance consumption of the collection process is closely related to the number of partitions , When the number of partitions is small , Almost no performance consumption , But when the number of partitions is particularly large , Each storage node will return a large amount of data to the primary computing node , Computing nodes need to be parsed and sorted , It will cause a lot of memory and CPU pressure .

therefore , The number of collected partitions must be kept within a certain limit , We need to ensure that the hot spot diagnosis function is available without affecting the performance of the database kernel . According to the actual situation of visual effect and data size, it is found that , The number of partitions displayed is controlled at 1600 The best effect will be achieved within , Default single table 16 Partitions can support 100 Hot spot analysis of Zhang Biao , It can meet most application scenarios .

The situation that there are too many partitioned tables will actually exist , Therefore, we designed to make the number of partitions exceed 1600 And less than 8000 The situation of , Partition statistics can be merged , Reduce the partition accuracy to support hot spot analysis in the case of large number of partitions , Theoretically, it can already support 1000 The hot spots of Zhang Biao are analyzed .

For tens of thousands or hundreds of thousands of tables , Both the information collection process and the front-end display will cause great resource pressure on the kernel and functional links , So for extreme cases , By default, hotspot data collection is not performed , But it supports users to dynamically modify database parameters , To specify the library table that needs hot spot analysis , Specify the analysis scope and analyze on demand .

Sum up , Accurate display of small-scale partitions 、 Medium scale partitions reduce the accuracy of the display 、 The super large-scale partition can specify the range display , It covers many different user needs .

Performance analysis

In order to test the impact of hotspot analysis function on the performance of database kernel , Several groups were carried out TPC-C A comparative experiment of , The conclusion is that this function has little impact on the performance of the kernel . Will be PolarDB-X Kernel CPU When the pressure reaches the maximum , Test the extreme conditions of enabling this function and page refresh to continuously obtain diagnostic results , The impact on performance is controlled in 1% Left right fluctuation , Considering the normal statistical error of the test process , It can be considered that this function has little impact on kernel performance .

Functional eggs

When the user does not create any partition tables , The page has no data to display , The conventional idea is to display a line of text on the front “ Temporarily no data ” To remind users , This makes users unable to experience the fun of this function . In order to let users experience the happiness of hotspot analysis function in advance when there is no data ,PolarDB-X For blank pages “ Do things ”, Combined with the front-end features of thermal analysis function , Draw out “NO DATA” Image , Users can also experience the hotspot analysis function when they have no data .

When the number of partitions of the user exceeds the upper limit of the display , Will draw “TOO BIG” Image of .

Thermal analysis function in addition to the above “ Main stream ” Beyond usage , Use your little head , Use your imagination , It can also make all kinds of “ non-mainstream ” usage , For example, we can use the color characteristics of thermal images , Accurately control partition access , Make one “ love ” Thermal image , You will become the first earth person in the world to express your success with a database ！