当前位置:网站首页>A journey of database full SQL analysis and audit system performance optimization
A journey of database full SQL analysis and audit system performance optimization
2022-06-10 22:53:00 【Meituan technical team】
Total quantity SQL( All access to the database SQL) It can effectively help secure database auditing , Help businesses quickly troubleshoot performance problems . Generally, it can be opened by genlog Log or start MySQL Audit plug-ins to obtain , Meituan has chosen a non-invasive by-pass packet capturing scheme , Use Go Language implementation . Either way , We need to focus on the performance loss of the database . This paper introduces the performance problems and Optimization Practice of the packet capturing scheme of meituan basic R & D platform in the practice of database audit , I hope it can be helpful or enlightening .
1 background
Database security has always been the focus of meituan information security team and database team , But for historical reasons , Access to the database only has the ability of sampling audit , As a result, some attacks cannot be found quickly 、 Loss determination and optimization . The security team based on historical experience , It is found that there are some characteristics in the database accessed by attacks , There are often specific SQL, We hope to pass the right MySQL Full analysis of access traffic , Identify idiomatic SQL, In terms of database security, we should have a definite aim .
2 The status quo and challenges
The following figure shows the sampling MySQL Architecture diagram of the audit system , The data acquisition end is based on pcap The packet capturing method is implemented , The data processing end adopts the log access scheme of meituan big data center . all MySQL Instances are deployed to collect MySQL Of relevant data rds-agent、 Log collection log-agent.rds-agent Grab the MySQL Access data , adopt log-agent Report to the log receiver , To reduce latency , The same machine room scheduling optimization has been done between the reporting end and the receiving end . The log receiver writes the data to the agreed Kafka in , The security team passed Storm Real time consumption Kafka Analyze the attack events , And regularly pull data persistence to Hive in .
We found that , It is usually the core that is attacked MySQL colony . Statistical discovery , These clusters are the largest single machine QPS Of 9995 Line about 5 About ten thousand times. .rds-agent As MySQL A parasitic process on the machine , For host stability , Resource control is also extremely important . To evaluate rds-agent At high QPS Next performance , We use it Sysbench Yes MySQL Carry out pressure test , Observe in different QPS Next rds-agent Captured data loss rate and CPU Consumption , According to the following pressure measurement data, the results are relatively poor :
| QPS | Loss rate | CPU utilization |
|---|---|---|
| 10368.72 | 1.03% | 307.35% |
| 17172.61 | 7.23% | 599.90% |
| 29005.51 | 28.75% | 662.39% |
| 42697.05 | 51.73% | 622.34% |
| 50833.50 | 63.95% | 601.39% |
How to be high QPS Lower loss rate and CPU Consume ? It has become an urgent problem and challenge in the current system .
3 Analysis and optimization
The following mainly introduces the relationship between the loss rate and CPU The problem of consumption , We focus on the data collection end in the process 、 Dispatch 、 Analysis and improvement of garbage collection and protocol .
3.1 Introduction to data acquisition terminal
First , Briefly introduce the data acquisition terminal rds-agent, It's a MySQL Process on instance , use Go Language writing , Open source based MysqlProbe Of Agent reform . By listening on the network card MySQL Port traffic , Analyze the access time of the client 、 source IP、 user name 、SQL、 Target database and target IP And other audit information . Here's the architecture , It is mainly divided into 5 Big function module :
1. probe
probe It means probe , Adopted gopacket As a packet capturing scheme , It is one of Google's open source Go Bag grabbing Library , Encapsulates the pcap.probe Encapsulate the captured original data link layer frames into TCP Layer packets . By variation Fowler-Noll-Vo Algorithm hash source and destination IP port Field , Fast implementation of splitting database connections to different worker in , This algorithm ensures that the hash values of incoming and outgoing packets of the same connection are the same .
2. watcher
The login user name is extremely important for auditing , Clients are often accessed through long connections MySQL, The login information only appears in MySQL Authentication handshake phase of communication protocol , It's easy to miss just by grabbing .
watcher By timing show processlist Get all the connection data of the current database , by force of contrast Host Field and the client of the current package ip port, Compensate for missing user name information .
3. worker
Different worker Responsible for managing the lifecycle of different database connections , One worker Manage multiple connections . Through regular comparison worker The current connection list of is the same as watcher Connection list in , Discover expired connections in time , Close and release related resources , Prevent memory leaks .
4. connStream
The core logic of the whole data acquisition end , Be responsible for according to MySQL Protocol analysis TCP Packets and identify specific SQL, One connection corresponds to one connStream Goroutine. because SQL May contain sensitive data ,connStream And responsible for SQL Desensitize , Specific SQL Identification strategy , For safety reasons , There's no expansion here .
5. sender
Responsible for data reporting logic , adopt thrift The agreement will connStream The parsed audit data is reported to log-agent.
3.2 Basic performance testing
Bag grabbing Library gopacket The performance of directly determines the upper limit of system performance , To find out if the problem lies in gopacket On , We wrote a simple tcp-client and tcp-server, Alone gopacket The first three steps involved in the data flow diagram ( As shown in the figure below ) Performance tests were carried out , From the test result data below , The performance bottleneck is not gopacket.
| QPS | pcap buffer | Loss rate | CPU utilization |
|---|---|---|---|
| 100000 | 100MB | 0% | 144.9% |
3.3 CPU Portrait analysis
Loss rate vs CPU Consumption and consumption are inseparable , To explore so high CPU Reasons for consumption , We use it Go Self contained pprof Tool to process CPU Consumption analysis , Several big heads can be summarized from the calling function of the flame diagram below :SQL desensitization 、 Unpack 、GC and Goroutine Dispatch . The following mainly introduces the optimization work done around them .
3.4 Desensitization analysis and improvement
because SQL May contain sensitive information , For safety reasons ,rds-agent For every one of them SQL Conduct desensitization treatment .
Desensitization operation uses pingcap Of SQL The parser is right SQL Template : Namely the SQL Replace all values in with “?” To achieve the goal , This operation needs to resolve SQL The abstract syntax tree of , The cost is high . Currently, only sampling and crawling specific SQL The needs of , It is not necessary to do the parsing for each SQL Desensitize . Here, the process is optimized , Sink desensitization to the reporting module , Desensitize only the samples finally sent out .
The results of this optimization are as follows :
| Contrast item | QPS | Loss rate | CPU utilization |
|---|---|---|---|
| Before improvement | 50833.50 | 63.95% | 601.39% |
| After improvement | 51246.47 | 31.95% | 259.59% |
3.5 Scheduling analysis and improvement
From the following data flow diagram, we can see that the whole link is relatively long , Prone to performance bottlenecks . At the same time, there are many high-frequency Goroutine( The red part ), Due to the large number ,Go Need to be in these often Goroutine Scheduling switching between , Switching is important for us CPU It is a burden for intensive programs .
For this problem , We made the following optimization :
- Shorten the link : shunt 、worker、 analysis SQL And so on Goroutine Parser .
- Reduce the switching frequency : Parser per 5ms Get one time from the queue of network protocol packets , It is equivalent to manually triggering the switch .(5ms It is also a compromise data after many tests , Too small will consume more CPU, Too much data will be lost )
The results of this optimization are as follows :
| Contrast item | QPS | Loss rate | CPU utilization |
|---|---|---|---|
| Before improvement | 51246.47 | 31.95% | 259.59% |
| After improvement | 51229.54 | 0% | 206.87% |
3.6 Analysis and improvement of garbage collection pressure
The following figure for rds-agent Grab the bag 30 second , Flame diagram of assigned pointer object . It can be seen that... Has been allocated 4 Thousands of objects ,GC The pressure can be imagined . About GC, We know the following two optimization schemes :
- Pooling :Go The standard library of provides a sync.Pool Object pool , Object allocation can be reduced by reusing objects , To reduce GC pressure .
- Manage memory manually : By system call mmap Direct to OS Application memory , Bypass GC, Realize manual memory management .
however , programme 2 Prone to memory leaks . From the point of view of stability , We finally chose the plan 1 To manage pointer objects created in high-frequency calling functions , The results of this optimization are as follows :
| Contrast item | QPS | Loss rate | CPU utilization |
|---|---|---|---|
| Before improvement | 51229.54 | 0% | 206.87% |
| After improvement | 51275.11 | 0% | 153.32% |
3.7 Unpacking analysis and improvement
MySQL Is based on TCP Above the agreement , During function commissioning , We found a lot of empty bags . From below MySQL client - The interaction diagram of the server-side data can be seen : When the client sends a message SQL command , Server response result , because TCP Message confirmation mechanism of , The client will send an empty ack Package to confirm the message , And the proportion of empty packages in the whole process is large , They will penetrate into the parsing process , At high QPS Next for Goroutine Dispatch and GC It is undoubtedly a burden .
The picture below is MySQL The unique format of the packet , Through analysis , We observed the following characteristics :
- A complete MySQL Packet length >=4Byte
- The client sends a new command sequence id All for 0 perhaps 1
and pcap Support to set filter rules , Let's exclude empty packages from the kernel layer , Here are the two filtering rules corresponding to the above features :
characteristic 1: ip[2:2] - ((ip[0] & 0x0f) << 2) - ((tcp[12:1] & 0xf0) >> 2) >= 4
characteristic 2: (dst host {localIP} and dst port 3306 and (tcp[(((tcp[12:1] & 0xf0) >> 2) + 3)] <= 0x01))
Copy code The results of this optimization are as follows :
| Contrast item | QPS | Loss rate | CPU utilization |
|---|---|---|---|
| Before improvement | 51275.11 | 0% | 153.32% |
| After improvement | 51246.02 | 0% | 142.58% |
Based on the above experience , We refactor the functional code of the data collection end , There are also other optimizations .
4 The end result
The following is the data comparison before and after optimization , The loss rate is the highest 60% Down to 0%, CPU Consumption from maximum occupancy 6 The number of nuclei dropped to 1 A nuclear .
In order to explore the function of capturing packets MySQL Performance loss , We use it Sysbench We did a performance comparison test . From the following result data, we can see that the function is right MySQL Of TPS、QPS And response time 99 The highest line index is about 6% Loss of .
5 The future planning
Although we have made various optimizations for the packet capturing scheme , However, for some delay sensitive services, the performance loss is still too large , Moreover, the scheme has poor support for some special scenarios : Such as TCP Packet loss occurs at the protocol layer 、 Retransmission 、 Out of order time ,MySQL The protocol layer uses compression 、 Large transmission SQL when . And the industry has generally adopted direct transformation MySQL Kernel mode to output the full quantity SQL, It also supports the output of more indicator data . at present , The database kernel team has also completed the development of the scheme , In the online gray level replacement packet capturing scheme . in addition , For the online total quantity SQL Missing end-to-end loss rate index , We will also make up for it in succession .
The author of this article
Su Han , From meituan basic R & D platform / Basic technology department / Database technology center .
Recruitment information
Meituan basic technology department - The database technology center is looking for senior 、 Senior technical expert ,Base Shanghai 、 Beijing . Meituan relational database is large , Rapid growth every year , Carrying hundreds of billions of traffic every day . Here you can experience high concurrency 、 High availability 、 Highly scalable business challenges , Can keep up with and develop the industry's cutting-edge technology , Realize the productivity improvement brought by technological progress , Please send your resume to :[email protected].
Read more technical articles of meituan technical team
front end | Algorithm | Back end | data | Security | Operation and maintenance | iOS | Android | test
| Reply to the official account menu bar dialog box 【2021 necessities 】、【2020 necessities 】、【2019 necessities 】、【2018 necessities 】、【2017 necessities 】 Other keywords , You can view the collection of technical articles of meituan technical team over the years .
| This article is produced by meituan technical team , The copyright belongs to meituan . You are welcome to reprint or use this article for non-commercial purposes such as sharing and communication , Please indicate “ The content is reproduced from meituan technical team ”. This article is without permission , No commercial reprint or use is allowed . Any commercial activity , Please send an email to [email protected] Apply for authorization .
边栏推荐
- kubernetes 二进制安装(v1.20.15)(六)部署WorkNode节点
- Web3 ecological decentralized financial platform sealem Finance
- 【TcaplusDB知识库】TcaplusDB TcapDB扩缩容介绍
- Basic use of mathtype7.x
- Opencv_100问_第四章 (16-20)
- 数字孪生:第三人称鼠标操作
- leetcode 130. Surrounded regions (medium)
- How to run Plink software -- three methods
- [tcapulusdb knowledge base] Introduction to tcapulusdb engine parameter adjustment
- Use of cocoeval function
猜你喜欢
![[Axi] explain the principle of two-way handshake mechanism of Axi protocol](/img/79/21da4ef7da4d12e586f2d605e8db78.png)
[Axi] explain the principle of two-way handshake mechanism of Axi protocol

Tcapulusdb Jun · industry news collection (VI)

小微企业如何低成本搭建微官网

Opencv_100问_第三章 (11-15)

Blue Bridge Cup_ Pick substring_ Combinatorial mathematics_ Multiplication principle/ Ruler method

Interpretation of dataset class of mmdetection

【phpstorm】 No data sources are configured to run this SQL and provide advanced c
![[tcapulusdb knowledge base] tcapulusdb tcapdb capacity expansion and contraction introduction](/img/3b/6546846fb7bbccbb0abe91422549ed.png)
[tcapulusdb knowledge base] tcapulusdb tcapdb capacity expansion and contraction introduction

Web3生态去中心化金融平台——Sealem Finance

Whale conference empowers intelligent epidemic prevention
随机推荐
mathtype7.x的基本使用
Interpretation of dataset class of mmdetection
[XPath] use following sibling to obtain the following peer nodes
【Py】接口签名校验失败可能是由于ensure_ascii的问题
【TcaplusDB知识库】TcaplusDB刷新tbus通道介绍
TcaplusDB君 · 行业新闻汇编(四)
【TcaplusDB知识库】TcaplusDB巡检统计介绍
Open source project PM how to design official website
CCF CSP 202109-3 impulse neural network
[tcapulusdb knowledge base] tcapulusdb tcapdb capacity expansion and contraction introduction
【TcaplusDB知识库】TcaplusDB TcapDB扩缩容介绍
但身示你五县非那最土zaiFKMW
[tcapulusdb knowledge base] tcapulusdb tcapproxy capacity expansion introduction
[tcapulusdb knowledge base] Introduction to tcapulusdb patrol inspection statistics
锁机制
CCF CSP 202109-4 收集卡牌
Niuke: expression evaluation
鲸会务为智能化防疫工作赋能
How to do well in the top-level design of informatization in the process of informatization upgrading of traditional enterprises
【TcaplusDB知识库】TcaplusDB机器初始化和上架介绍