当前位置:网站首页>We were tossed all night by a Kong performance bug
We were tossed all night by a Kong performance bug
2022-07-24 09:08:00 【Hua Weiyun】
The story background
stay Erda In the technical architecture of , We used kong As API Technology selection of gateway . Because it has the characteristics of high concurrency and low latency , Combined with Kubernetes Ingress Controller, Declarative configuration based on cloud native , Can achieve rich API Strategy .
A series of failed attempts
Parameter tuning
NGINX_WORKER_PROCESSES、MEM_CACHE_SIZE、 DB_UPDATE_FREQUENCY、WORKER_STATE_UPDATE_FREQUENCY Parameters as well as postgres Of work_mem、 share_buffers Have been properly tuned .however , No effect .
Clean up the data
kong Instance read-write separation
I'm sure it's admin After the reason of the interface , We decided to admin Business related kong Instance separation , hope admin The call of will not affect the normal traffic access of the business , Expect to achieve kong Of admin Slow interface , But don't affect the access performance of the business .
However , No effect .
postgres transfer RDS
kong After the efforts at the level are fruitless , We also observed when calling admin Interface test ,postgres The process of has also increased a lot ,CPU The utilization rate has also increased , Also decided to pg Migrate to More professional RDS in .
still , No effect .
Roll back
Finally, we rolled back to 0.14 edition , Pursue temporary “ Peace of mind ”.
thus , The online attempt is basically a paragraph , It also roughly finds out the conditions for the recurrence of the problem , So we decided to build an environment offline to continue to find out the cause of the problem .
The way to reproduce the problem
Import data
We will have problems in the cluster postgre After the data is backed up, it is imported into a new cluster :
psql -h 127.0.0.1 -U kong < kong.sql And turn it on kong Of prometheus plug-in unit , Easy to use grafana To view the performance icon :
curl -X POST http://10.97.4.116:8001/plugins --data "name=prometheus"Phenomenon one
curl http://10.97.4.116:8001
Phenomenon two
Then let's simulate the call encountered online admin Poor service access performance after interface , First call admin Interface to create a business api, For the test , We created one service And one. routes:
curl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu2' -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu2/routes \-d "name=test2" \ -d "paths[1]=/baidu2" You can use it later curl http://10.97.4.116:8000/baidu2 To simulate the business interface for testing .
Get ready admin Interface test script , Create and delete a service/route, Intersperse one in the middle service list.
#!/bin/bashcurl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu' -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu/routes \ -d "name=test" \ -d "paths[1]=/baidu" curl -s http://10.97.4.116:8001/servicescurl -i -X DELETE http://10.97.4.116:8001/services/baidu/routes/testfor i in `seq 1 100`; do sh 1.sh ; donecurl http://10.97.4.116:8000/baidu2
Accompanying phenomenon
kong Example of cpu Follow mem Both continue to rise , And when admin This phenomenon is still not over after the interface call .mem It will rise to a certain extent nginx worker process oom fall , And then restart , This may be the reason for the slow access ;
We set it up
KONG_NGINX_WORKER_PROCESSESby 4, And for pod The memory of is 4G When ,pod The overall memory will be stable at 2.3G, however call admin Interface test ,pod Memory will keep rising to more than 4G, Trigger worker Of OOM, So I will pod The memory of is adjusted to 8G. Call again admin Interface , Find out pod Memory is still rising , It just rose to 4.11 G It's over , This seems to mean that we are going to set pod The memory of isKONG_NGINX_WORKER_PROCESSEStwice as much , This problem is solved ( But there is another important question is why to call once admin Interface , It will cause the memory to rise so much );in addition , When I keep calling admin At the interface , The final memory will continue to grow and stabilize to 6.9G.
At this time, we will abstract the problem :
Continue to investigate what is taking up memory :
pmap -x [pid] I checked it twice worker Memory distribution of the process , What changes is the part framed in the second picture , Judging from the address, the whole memory has been changed , But after exporting and stringing the memory data , There is no effective information for further investigation .

Conclusion
The question is related to kong The upgrade (0.14 --> 2.2.0) It doesn't matter. , Use it directly 2.2.0 Version will also have this problem ; kong every other worker_state_update_frequencyIt will be rebuilt in memory after time router, Once reconstruction starts, it will lead to Memory goes up , After looking at the code, the problem isRouter.newHere's the way , Will apply for lrucache But there is noflush_all, According to the latest 2.8.1 Version of lrucache After the release, the problem still exists ;That is to say kong Of Router.newWhen other logic in the method arrives, the memory rises ;

This shows that the problem is kong There is a performance bug, It still exists in the latest version , When route Follow service When reaching a certain order of magnitude, there will be calls admin Interface , Lead to kong Of worker Memory is rising rapidly , bring oom This leads to poor business access performance , The temporary solution can be to reduce NGINX_WORKER_PROCESSESAnd increase kong pod Of memory , Make sure to call admin The memory required after the interface is enough to use without triggering oom, To ensure the normal use of business .
边栏推荐
- Wildcards in MySQL like statements: percent, underscore, and escape
- Description of MATLAB functions
- 使用分区的优点
- [FFH] openharmony gnawing paper growth plan -- Application of cjson in traditional c/s model
- 【汇编语言实战】一元二次方程ax2+bx+c=0求解(含源码与过程截屏,可修改参数)
- The detailed process of building discuz forum is easy to understand
- Tiktok shop will add a new site, and the Singapore site will be launched on June 9
- Office fallback version, from 2021 to 2019
- Virtual machine terminator terminal terminator installation tutorial
- The difference between classification and regression
猜你喜欢

Leetcode94-二叉树的中序遍历详解

DP longest common subsequence detailed version (LCS)

What is the "age limit" on tiktok and how to solve it?

Un7.22: how to upload videos and pictures simultaneously with the ruoyi framework in idea and vs Code?

VGA character display based on FPGA

Tiflash source code reading (V) deltatree storage engine design and implementation analysis - Part 2

Assignment operator (geritilent software - Jiuye training)

Using OpenCV to do a simple face recognition

Tang Yudi opencv background modeling

Redis learning - Introduction to redis and NiO principles
随机推荐
How should tiktok shop cooperate with live broadcast in the background?
Open source summer interview | learn with problems, Apache dolphin scheduler, Wang Fuzheng
UE5影视动画渲染MRQ分层学习笔记
Unity解决Package Manager“You seem to be offline”
Why does TCP shake hands three times instead of two times (positive version)
What is tiktok creator fund and how to withdraw it?
Run little turtle to test whether the ROS environment in the virtual machine is complete
Matlab各函数说明
Data center: started in Alibaba and started in Daas
mysql URL
科目1-3
C语言练习题目+答案:
使用分区的优点
Xtrabackup realizes full backup and incremental backup of MySQL
Three tips for finding the latest trends on tiktok
& 和 &&、| 和 || 的区别
TiFlash 源码阅读(五) DeltaTree 存储引擎设计及实现分析 - Part 2
Guys, what parameters can be set when printing flinksql so that the values can be printed? This later section is omitted. It's inconvenient. I read the configuration on the official website
Replace the function of pow with two-dimensional array (solve the time overrun caused by POW)
gnuplot软件学习笔记