当前位置：网站首页>The practice of tidb slow log in accompanying fish

The practice of tidb slow log in accompanying fish

2022-06-24 04:00:00 【PingCAP】

TiDB The practice of slow log in accompanying fish

Author's brief introduction ： Liu Jiang , Person in charge of fish English database ,TUG 2020 year MOA. Responsible for the operation and maintenance of fish companion database 、 Big data operation and maintenance and database platform construction .

This article comes from fish English DBA Group leader Liu Jiang is in 「 Energy titanium 」 Share the second activity , Liu Jiang shared TiDB The practice of slow log in accompanying fish . This paper will start from the following three aspects ：

The first part is Background and needs , First of all, let's introduce the accompanying fish TiDB Background of slow logging system , And based on this background , What do you want to make the slow log system look like ;
The second part introduces Slow log system How to do it ;
The third part is through several Online cases , See how the slow log system locates online problems .

Background and needs

First half of this year , Alibaba cloud has released a new generation DAS（ Database autonomous service ） service , It talks about the database ,90% All the above problems come from the database Exception request . In fact, in our daily database problem scenario , Most of the problems are also anomalies SQL request , Like the database bug Or it's a problem caused by the machine , There are still relatively few . For anomalies SQL The positioning of , Slow log analysis of database , Is a particularly effective means .

So we usually use slow logs to analyze problems , What pain points will there be ？ Before making a slow log system , The cluster's slow logs are distributed across multiple machines , If there is a problem with the database , You need to log in to multiple machines and analyze them one by one , The efficiency of problem handling is very low . Especially when the cluster scale is very large , Basically, there is no way to quickly locate the problem .

Of course ,TiDB stay 4.0 Version supports Dashboard, We can go through Dashboard View the slow log information of the whole cluster , Like recently 15 Minutes or the last half hour's slow log . But when the system really goes wrong , There will be a lot of slow logs ,Dashboard Will face Calculate load Equal performance problem , meanwhile Dashboard Retrieval and analysis statistics are not supported , This is not good for us to quickly locate the exception SQL.

TiDB The system library comes with a table （INFORMATION_SCHEMA.SLOW_QUERY） To store slow logs in real time , We can also use it to locate exceptions SQL, But this table is a relational table , There is no index in itself , When the amount of ambition is very large , Multidimensional retrieval and analysis is particularly slow . meanwhile , For multidimensional retrieval and analysis statistics of relational tables , It's not what it's good at .

Based on the above pain points , The slow log system with fish needs to meet the following requirements ：

First , Is the centralized collection of slow logs , It can gather the slow logs of multiple online clusters or even dozens of clusters together , Convenient for centralized analysis , To do so entrance Just Unified 了 .

secondly , Ensure that the slow logs collected are Quasi real-time Of . Because if the slow log collection delay is too large , It is not helpful for dealing with online problems and analyzing problems .

then , Slow logs can Retrieval and statistical analysis . Because there are many slow logs when problems occur , At this time, if you can search and statistical analysis , You can quickly locate the exception SQL.

Last , The slow log system needs to support Monitoring and alarm .

System details

Based on the above background and needs , Let's take a look at how the slow log system with fish does .

System architecture

Slow log system with fish The overall architecture , As shown in the figure below . We are TiDB Server When the machine initializes, it deploys Filebeat Components , Through it, the collected slow log , Write to Kafka, At the same time, hit the machine IP Information . And then through logstash Parse out the fields we focus on , Store in ES.ES Itself is a search engine , Do data analysis and Statistics , The speed is very fast . At the same time, we go through Kibana see ES Slow log data in , do visualization Statistics and retrieval of .

1.png

Of course , The slow log system also has an architecture , As shown in the figure below .Clickhouse It is a popular analytical database in recent years , Some companies send monitoring data to Clickhouse, do Real time monitoring and alarm analysis . We can use Flink Replace logstash Components , By simple calculation , Write slow log data to Clickhouse.

2.png

Because the slow log system with fish was done earlier , So the use of ELK The architecture of .

First ,Filebeat enough Light weight . We have done parsing tests on hundreds of megabytes of files online , The conclusion is that there is basically no impact on the performance of the database .

secondly , When something goes wrong online , The instantaneous log volume is particularly large , If you write the slow log directly to Logstash, Would be right Logstash Impact caused by machine load , So pass Kafka To eliminate the peak .

When Logstash When parsing slow logs , Fuzzy matching rules should be used as little as possible . Because we use fuzzy matching rules to parse slow logs , Will cause the machine to CPU The height problem .

then ,ES The index itself is Schema Free Of , Then add the data structure of inverted index , This feature is very suitable for statistical analysis scenarios .

meanwhile , adopt Kibana Do visual retrieval and statistical analysis .

Last , We always 2 Read once per minute ES Slow log data , Do monitoring and alarm .

Log collection

Next, let's look at the details of each component , On the left is Filebeat Configuration of acquisition , As shown in the figure below . We are deploying Filebeat When , Will deploy the machine IP Pass it in , So in the later statistics , We know which machine the slow log comes from . Then on the right is Kafka Configuration of , After the data is collected , It will be sent to Kafka colony .

3.png

Here's a line TiDB Examples of slow logs , This is the format .

4.png

Slow log to Time start , with Query_time,Total_keys,Process_keys,DB Etc , Finally, SQL Information about . This slow log is passing Filebeat After the collection of , It will become a line of text like this , As shown in the figure below . If you save this line of text directly to ES Words , There is no way to do retrieval and statistical analysis .

5.png

Field filtering

A text cannot be used for statistical analysis and multi-dimensional Retrieval , If we do , You need to put... In this line of text Field analytical come out , So what fields do we focus on ？ So let's look at this first MySQL 5.7 Slow log , As shown in the figure below . We're dealing with MySQL When it breaks down , First, I'll look at one SQL Query time for , If you change SQL The query time is relatively long , We think it may be the cause of online problems .

6.png

But when there is a large amount of online requests , Long query time does not mean that it is the root cause of the problem , Also combined with other keywords to comprehensive analysis . For example, a particularly important keyword in the slow log Rows_examined,Rows_examined Represents the number of rows scanned by data logic , Usually we go through Query_time and Rows_examined Comprehensive analysis of , To locate the problem SQL.

Next , Let's see TiDB Slow log . So let's look at this first TiDB 3.0.13 go KV Slow log of interface , As shown in the figure below . Here are some important keywords , for instance Query_time,DB,SQL and Prewrite_time, These keywords are very helpful for locating online problems .

7.png

Here is another format TiDB 3.0.13 Slow log , It's walking DistSQL Interface , As shown in the figure below .

8.png

It has nothing but Query_time、Total_keys After printing at the same time , also Index_names, On behalf of this SQL Did you go to the index , meanwhile Index_names There are also table names and other information in the field .

It's over TiDB 3.0.13 Slow log , Let's take another look TiDB 4.0.13 Slow log , Slow log content compared to TiDB 3.0.13 The version has added some fields , such as KV_total,PD_total,Backoff_total Etc .

9.png

Through the slow log above , We can find that it contains a lot of key information , We can even see where the request is slow in the database . If we collect the slow logs , Retrieval through certain relationships , Even aggregate , It is very helpful to find problems .

In the company of fish , We are concerned about TiDB The slow log fields mainly include the following ：

TiDB IP： In the deployment Filebeat When , The machine will IP Pass it in . With this IP, We can know the source of the log and follow IP The dimension of ;
DB： Used when executing statements DATABASE. We can follow DB Dimension statistics , At the same time, you can also use the internal system to DB And specific database clusters ;
TABLE： Some slow logs can resolve table names , Statistics can be performed according to the dimensions of the table ;
IDX_NAME： except Insert sentence And go KV Outside the statement of the interface , The slow log information records whether the statement has been indexed ;
TOTAL_KEYS：Coprocessor Swept Key The number of ;
PROCESS_KEYS：Coprocessor To deal with the Key The number of ;
QUERY_TIME： Statement execution time ;
SQL： Concrete SQL sentence .

Field analytical

adopt Logstash Of Grok The syntax parses the fields required by a slow log , As shown in the figure below .

10.png

Statistical analysis

The following figure shows all our clusters in recent 30 Slow log within minutes . We go through Kibana, You can see the total number of slow logs , Can pass DB、Quwry_time、Total_keys To search , You can also press DB、Table、IP And other dimensions . This can greatly improve the positioning problem SQL The efficiency of .

11.png

In addition to Kibana In addition to visual retrieval and aggregation , Our internal platform system will also aggregate and sort through various dimensions of slow logs , Aggregation dimensions include colony , library 、 surface 、IP And operation type etc. , Sorting rules can be sorted according to Total time , The average time taken , Maximum time consumption and number of logs Equal dimension . We can send slow log reports to R & D regularly .

12.png

Monitoring alarm

In addition to visual retrieval and statistical analysis , We also monitor and alarm the slow log , As shown in the figure below . By counting the slowness of each library SQL Number of pieces , Each library setting Two alarm rules . for instance advertising This library , The general rule is that the execution time exceeds 200 millisecond , The alarm threshold reaches 5 The of the bar will alarm .

However, we find that the execution frequency is very low on the line , However, the implementation time is particularly long , There is no way to cover... By general rules . So a rule is added later ： Execution time exceeded 500 millisecond , The alarm threshold reaches 2 This is an alarm . This is slow for online SQL It basically covers .

13.png

The alarm information , As shown in the figure below . You can look at the picture on the left , We collected slow logs DB Field , Through the internal system DB Associate with database clusters , We can see which cluster the slow log occurs in , Which library . When slow logs are generated , How many of these slow logs are exceeded threshold , The total SQL Information such as execution time and average time , Through such slow log alarm information , We can judge the problem in a short time SQL How big is the impact on the line .

14.png

Case sharing

After talking about how the slow log system does , Next, let's take a look at how we find online problems through the slow log system .

The first case , As shown in the figure below . We found the cluster one day Coprocessor CPU Up there , At the same time, the delay on the right also goes up , Problems can be found quickly through the slow logging system SQL, its Total_keys and Process_keys It costs a lot , Besides, it's index_name It's empty. , It doesn't go to the index . By giving this SQL Combined with the index , The performance problem is solved quickly .

15.png

The second case is similar , We found that Coprocessor CPU The indicators are up , Then retrieve through the slow log system , You can find SQL No index . Locating problems is fast , Because by ES Aggregation and retrieval is particularly fast .

16.png

summary

The above is some experience in performance problem detection and database performance risk prevention of fish accompanying slow log system , I hope that's helpful . future , We will continue to mine the information of slow logs , Enrich the functions of the slow log system , Escort the companion database .

原网站

版权声明
本文为[PingCAP]所创，转载请带上原文链接，感谢
https://yzsam.com/2021/09/20210915193638366b.html