当前位置:网站首页>The practice of tidb slow log in accompanying fish
The practice of tidb slow log in accompanying fish
2022-06-24 04:00:00 【PingCAP】
TiDB The practice of slow log in accompanying fish
Author's brief introduction : Liu Jiang , Person in charge of fish English database ,TUG 2020 year MOA. Responsible for the operation and maintenance of fish companion database 、 Big data operation and maintenance and database platform construction .
This article comes from fish English DBA Group leader Liu Jiang is in 「 Energy titanium 」 Share the second activity , Liu Jiang shared TiDB The practice of slow log in accompanying fish . This paper will start from the following three aspects :
- The first part is Background and needs , First of all, let's introduce the accompanying fish TiDB Background of slow logging system , And based on this background , What do you want to make the slow log system look like ;
- The second part introduces Slow log system How to do it ;
- The third part is through several Online cases , See how the slow log system locates online problems .
Background and needs
First half of this year , Alibaba cloud has released a new generation DAS( Database autonomous service ) service , It talks about the database ,90% All the above problems come from the database Exception request . In fact, in our daily database problem scenario , Most of the problems are also anomalies SQL request , Like the database bug Or it's a problem caused by the machine , There are still relatively few . For anomalies SQL The positioning of , Slow log analysis of database , Is a particularly effective means .
So we usually use slow logs to analyze problems , What pain points will there be ? Before making a slow log system , The cluster's slow logs are distributed across multiple machines , If there is a problem with the database , You need to log in to multiple machines and analyze them one by one , The efficiency of problem handling is very low . Especially when the cluster scale is very large , Basically, there is no way to quickly locate the problem .
Of course ,TiDB stay 4.0 Version supports Dashboard, We can go through Dashboard View the slow log information of the whole cluster , Like recently 15 Minutes or the last half hour's slow log . But when the system really goes wrong , There will be a lot of slow logs ,Dashboard Will face Calculate load Equal performance problem , meanwhile Dashboard Retrieval and analysis statistics are not supported , This is not good for us to quickly locate the exception SQL.
TiDB The system library comes with a table (INFORMATION_SCHEMA.SLOW_QUERY) To store slow logs in real time , We can also use it to locate exceptions SQL, But this table is a relational table , There is no index in itself , When the amount of ambition is very large , Multidimensional retrieval and analysis is particularly slow . meanwhile , For multidimensional retrieval and analysis statistics of relational tables , It's not what it's good at .
Based on the above pain points , The slow log system with fish needs to meet the following requirements :
First , Is the centralized collection of slow logs , It can gather the slow logs of multiple online clusters or even dozens of clusters together , Convenient for centralized analysis , To do so entrance Just Unified 了 .
secondly , Ensure that the slow logs collected are Quasi real-time Of . Because if the slow log collection delay is too large , It is not helpful for dealing with online problems and analyzing problems .
then , Slow logs can Retrieval and statistical analysis . Because there are many slow logs when problems occur , At this time, if you can search and statistical analysis , You can quickly locate the exception SQL.
Last , The slow log system needs to support Monitoring and alarm .
System details
Based on the above background and needs , Let's take a look at how the slow log system with fish does .
System architecture
Slow log system with fish The overall architecture , As shown in the figure below . We are TiDB Server When the machine initializes, it deploys Filebeat Components , Through it, the collected slow log , Write to Kafka, At the same time, hit the machine IP Information . And then through logstash Parse out the fields we focus on , Store in ES.ES Itself is a search engine , Do data analysis and Statistics , The speed is very fast . At the same time, we go through Kibana see ES Slow log data in , do visualization Statistics and retrieval of .
Of course , The slow log system also has an architecture , As shown in the figure below .Clickhouse It is a popular analytical database in recent years , Some companies send monitoring data to Clickhouse, do Real time monitoring and alarm analysis . We can use Flink Replace logstash Components , By simple calculation , Write slow log data to Clickhouse.
Because the slow log system with fish was done earlier , So the use of ELK The architecture of .
First ,Filebeat enough Light weight . We have done parsing tests on hundreds of megabytes of files online , The conclusion is that there is basically no impact on the performance of the database .
secondly , When something goes wrong online , The instantaneous log volume is particularly large , If you write the slow log directly to Logstash, Would be right Logstash Impact caused by machine load , So pass Kafka To eliminate the peak .
When Logstash When parsing slow logs , Fuzzy matching rules should be used as little as possible . Because we use fuzzy matching rules to parse slow logs , Will cause the machine to CPU The height problem .
then ,ES The index itself is Schema Free Of , Then add the data structure of inverted index , This feature is very suitable for statistical analysis scenarios .
meanwhile , adopt Kibana Do visual retrieval and statistical analysis .
Last , We always 2 Read once per minute ES Slow log data , Do monitoring and alarm .
Log collection
Next, let's look at the details of each component , On the left is Filebeat Configuration of acquisition , As shown in the figure below . We are deploying Filebeat When , Will deploy the machine IP Pass it in , So in the later statistics , We know which machine the slow log comes from . Then on the right is Kafka Configuration of , After the data is collected , It will be sent to Kafka colony .
Here's a line TiDB Examples of slow logs , This is the format .
Slow log to Time start , with Query_time,Total_keys,Process_keys,DB Etc , Finally, SQL Information about . This slow log is passing Filebeat After the collection of , It will become a line of text like this , As shown in the figure below . If you save this line of text directly to ES Words , There is no way to do retrieval and statistical analysis .
Field filtering
A text cannot be used for statistical analysis and multi-dimensional Retrieval , If we do , You need to put... In this line of text Field analytical come out , So what fields do we focus on ? So let's look at this first MySQL 5.7 Slow log , As shown in the figure below . We're dealing with MySQL When it breaks down , First, I'll look at one SQL Query time for , If you change SQL The query time is relatively long , We think it may be the cause of online problems .
But when there is a large amount of online requests , Long query time does not mean that it is the root cause of the problem , Also combined with other keywords to comprehensive analysis . For example, a particularly important keyword in the slow log Rows_examined,Rows_examined Represents the number of rows scanned by data logic , Usually we go through Query_time and Rows_examined Comprehensive analysis of , To locate the problem SQL.
Next , Let's see TiDB Slow log . So let's look at this first TiDB 3.0.13 go KV Slow log of interface , As shown in the figure below . Here are some important keywords , for instance Query_time,DB,SQL and Prewrite_time, These keywords are very helpful for locating online problems .
Here is another format TiDB 3.0.13 Slow log , It's walking DistSQL Interface , As shown in the figure below .
It has nothing but Query_time、Total_keys After printing at the same time , also Index_names, On behalf of this SQL Did you go to the index , meanwhile Index_names There are also table names and other information in the field .
It's over TiDB 3.0.13 Slow log , Let's take another look TiDB 4.0.13 Slow log , Slow log content compared to TiDB 3.0.13 The version has added some fields , such as KV_total,PD_total,Backoff_total Etc .
Through the slow log above , We can find that it contains a lot of key information , We can even see where the request is slow in the database . If we collect the slow logs , Retrieval through certain relationships , Even aggregate , It is very helpful to find problems .
In the company of fish , We are concerned about TiDB The slow log fields mainly include the following :
- TiDB IP: In the deployment Filebeat When , The machine will IP Pass it in . With this IP, We can know the source of the log and follow IP The dimension of ;
- DB: Used when executing statements DATABASE. We can follow DB Dimension statistics , At the same time, you can also use the internal system to DB And specific database clusters ;
- TABLE: Some slow logs can resolve table names , Statistics can be performed according to the dimensions of the table ;
- IDX_NAME: except Insert sentence And go KV Outside the statement of the interface , The slow log information records whether the statement has been indexed ;
- TOTAL_KEYS:Coprocessor Swept Key The number of ;
- PROCESS_KEYS:Coprocessor To deal with the Key The number of ;
- QUERY_TIME: Statement execution time ;
- SQL: Concrete SQL sentence .
Field analytical
adopt Logstash Of Grok The syntax parses the fields required by a slow log , As shown in the figure below .
Statistical analysis
The following figure shows all our clusters in recent 30 Slow log within minutes . We go through Kibana, You can see the total number of slow logs , Can pass DB、Quwry_time、Total_keys To search , You can also press DB、Table、IP And other dimensions . This can greatly improve the positioning problem SQL The efficiency of .
In addition to Kibana In addition to visual retrieval and aggregation , Our internal platform system will also aggregate and sort through various dimensions of slow logs , Aggregation dimensions include colony , library 、 surface 、IP And operation type etc. , Sorting rules can be sorted according to Total time , The average time taken , Maximum time consumption and number of logs Equal dimension . We can send slow log reports to R & D regularly .
Monitoring alarm
In addition to visual retrieval and statistical analysis , We also monitor and alarm the slow log , As shown in the figure below . By counting the slowness of each library SQL Number of pieces , Each library setting Two alarm rules . for instance advertising This library , The general rule is that the execution time exceeds 200 millisecond , The alarm threshold reaches 5 The of the bar will alarm .
However, we find that the execution frequency is very low on the line , However, the implementation time is particularly long , There is no way to cover... By general rules . So a rule is added later : Execution time exceeded 500 millisecond , The alarm threshold reaches 2 This is an alarm . This is slow for online SQL It basically covers .
The alarm information , As shown in the figure below . You can look at the picture on the left , We collected slow logs DB Field , Through the internal system DB Associate with database clusters , We can see which cluster the slow log occurs in , Which library . When slow logs are generated , How many of these slow logs are exceeded threshold , The total SQL Information such as execution time and average time , Through such slow log alarm information , We can judge the problem in a short time SQL How big is the impact on the line .
Case sharing
After talking about how the slow log system does , Next, let's take a look at how we find online problems through the slow log system .
The first case , As shown in the figure below . We found the cluster one day Coprocessor CPU Up there , At the same time, the delay on the right also goes up , Problems can be found quickly through the slow logging system SQL, its Total_keys and Process_keys It costs a lot , Besides, it's index_name It's empty. , It doesn't go to the index . By giving this SQL Combined with the index , The performance problem is solved quickly .
The second case is similar , We found that Coprocessor CPU The indicators are up , Then retrieve through the slow log system , You can find SQL No index . Locating problems is fast , Because by ES Aggregation and retrieval is particularly fast .
summary
The above is some experience in performance problem detection and database performance risk prevention of fish accompanying slow log system , I hope that's helpful . future , We will continue to mine the information of slow logs , Enrich the functions of the slow log system , Escort the companion database .
边栏推荐
- 内存泄漏之KOOM-Shark中的Hprof信息
- Clickhouse synchronous asynchronous executor
- openEuler社区理事长江大勇:共推欧拉开源新模式 共建开源新体系
- How to save pictures to CDN? What are the advantages of this?
- 系统的去学习一门编程语言,原来有如此捷径
- 4. go deep into tidb: detailed explanation of the implementation process of the implementation plan
- Yuanqi forest pushes "youkuang", and farmers' mountain springs follow the "roll"?
- How to do the right thing in digital marketing of consumer goods enterprises?
- Hprof information in koom shark with memory leak
- web渗透测试----5、暴力破解漏洞--(1)SSH密码破解
猜你喜欢

ModStartCMS 企业内容建站系统(支持 Laravel9)v4.2.0

多任务视频推荐方案,百度工程师实战经验分享

Black hat SEO actual combat search engine snapshot hijacking

你了解TLS协议吗?

15+城市道路要素分割应用,用这一个分割模型就够了

Common content of pine script script

多任务视频推荐方案,百度工程师实战经验分享

在pycharm中pytorch的安装
Thank you for your recognition! One thank-you note after another

黑帽SEO实战之目录轮链批量生成百万页面
随机推荐
How the new operator works
Create a telepresence USB drive using the DD command
Exploration of web application component automatic discovery
Browser rendering mechanism
Koom of memory leak
Life reopens simulation / synthetic big watermelon / small air conditioner Inventory of 2021 popular open source projects
Modstartcms enterprise content site building system (supporting laravel9) v4.2.0
getLocationInWindow源码
Demonstration of C language structure function research
A problem of testing security group access in windows10
RPM 包的构建 - SPEC 基础知识
3D visualization of Metro makes everything under control
Web penetration test - 5. Brute force cracking vulnerability - (7) MySQL password cracking
Common content of pine script script
Structure size calculation of C language struct
Student information management system user manual
Configuration process of easygbs access to law enforcement recorder
[Numpy] Numpy对于NaN值的判断
MySQL cases SQL causes 100% CPU utilization
How to restore the default route for Tencent cloud single network card machine