当前位置:网站首页>We were tossed all night by a Kong performance bug
We were tossed all night by a Kong performance bug
2022-07-26 21:11:00 【Erda technical team】
The story background
stay Erda In the technical architecture of , We used kong As API Technology selection of gateway . Because it has the characteristics of high concurrency and low latency , Combined with Kubernetes Ingress Controller, Declarative configuration based on cloud native , Can achieve rich API Strategy .
In our earliest delivered cluster ,kong It is still relatively early 0.14 edition , With the increasing requirements of business level for security , We need to be based on kong Implement security plug-ins , Help the system to have better security capabilities . Due to the earlier 0.14 Version cannot be used go-pluginserver To expand kong Plug in mechanism of , We have to put kong Upgrade to a relatively new 2.2.0 edition .
The upgrade process will not be repeated here , Basically, it is upgraded smoothly step by step according to the official documents , But in the days after the upgrade , our SRE The team received intensive consultation and even criticism , Businesses deployed on the cluster Intermittent inaccessibility , Very high latency .
A series of failed attempts
Parameter tuning
At first, in order to quickly fix this problem , We are right. kong Of NGINX_WORKER_PROCESSES、MEM_CACHE_SIZE、 DB_UPDATE_FREQUENCY、WORKER_STATE_UPDATE_FREQUENCY Parameters and postgres Of work_mem、 share_buffers Have been properly tuned .
however , No effect .
Clean up the data
Due to the historical reasons of this cluster , Will register or delete frequently api data , Therefore, about 5 More than 10000 articles route perhaps service data .
We suspect that the performance degradation is caused by the large amount of data , And then combine erda Data pairs in kong Delete the historical data in , In the process of deletion, the deletion is slow and at the same time kong A sharp decline in performance .
After several tests, we determined 「 Just call admin The interface of leads to kong Performance degradation 」 This conclusion , It is very similar to the problem of the community , Links are as follows :
https://github.com/Kong/kong/issues/7543
kong Instance read-write separation
I'm sure it's admin After the reason of the interface , We decided to admin Business related kong Instance separation , hope admin The call of will not affect the normal traffic access of the business , Expect to achieve kong Of admin Slow interface , But don't affect the access performance of the business .
However , No effect .
postgres transfer RDS
kong After the efforts at the level are fruitless , We also observed when calling admin Interface test ,postgres The process of has also increased a lot ,CPU The utilization rate has also increased , Also decided to pg Migrate to More professional RDS in .
still , No effect .
Roll back
Finally, we rolled back to 0.14 edition , Pursue temporary “ Peace of mind ”.
thus , The online attempt is basically a paragraph , It also roughly finds out the conditions for the recurrence of the problem , So we decided to build an environment offline to continue to find out the cause of the problem .
The way to reproduce the problem
We will have problems kong Of postgres Import a copy of the data into the development environment , simulation 「 call admin Interface is a sharp decline in performance 」 The situation of , And find a solution .
Import data
We will have problems in the cluster postgre After the data is backed up, it is imported into a new cluster :
psql -h 127.0.0.1 -U kong < kong.sqlAnd turn it on kong Of prometheus plug-in unit , Easy to use grafana To view the performance icon :
curl -X POST http://10.97.4.116:8001/plugins --data "name=prometheus"Phenomenon one
call admin Service / Same slow , It is consistent with the online phenomenon , Call when the amount of data is large admin Of / Directories take more time .
curl http://10.97.4.116:8001
Phenomenon two
Then let's simulate the call encountered online admin Poor service access performance after interface , First call admin Interface to create a business api, For the test , We created one service And one. routes:
curl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu2' -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu2/routes \-d "name=test2" \ -d "paths[1]=/baidu2" You can use it later curl http://10.97.4.116:8000/baidu2 To simulate the business interface for testing .
Get ready admin Interface test script , Create and delete a service/route, Intersperse one in the middle service list.
#!/bin/bashcurl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu' -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu/routes \ -d "name=test" \ -d "paths[1]=/baidu" curl -s http://10.97.4.116:8001/servicescurl -i -X DELETE http://10.97.4.116:8001/services/baidu/routes/testThen continue to call the script :
for i in `seq 1 100`; do sh 1.sh ; doneIn the process of continuously calling the script, access a business interface , You will find it very slow , It is completely consistent with the online phenomenon .
curl http://10.97.4.116:8000/baidu2
PS: Streamline scripts , Only one write is triggered after , Or deletion will also trigger this phenomenon
Accompanying phenomenon
- kong Example of cpu Follow mem Both continue to rise , And when admin This phenomenon is still not over after the interface call .mem It will rise to a certain extent nginx worker process oom fall , And then restart , This may be the reason for the slow access ;
- We set it up
KONG_NGINX_WORKER_PROCESSESby 4, And for pod The memory of is 4G When ,pod The overall memory will be stable at 2.3G, however call admin Interface test ,pod Memory will keep rising to more than 4G, Trigger worker Of OOM, So I will pod The memory of is adjusted to 8G. Call again admin Interface , Find out pod Memory is still rising , It just rose to 4.11 G It's over , This seems to mean that we are going to set pod The memory of isKONG_NGINX_WORKER_PROCESSEStwice as much , This problem is solved ( But there is another important question is why to call once admin Interface , It will cause the memory to rise so much ); - in addition , When I keep calling admin At the interface , The final memory will continue to grow and stabilize to 6.9G.
At this time, we will abstract the problem :
call 「kong admin Interface 」 Cause the memory to keep rising , And then trigger oom Lead to worker By kill fall , Eventually, business access is slow .
Continue to investigate what is taking up memory :
I use pmap -x [pid] I checked it twice worker Memory distribution of the process , What changes is the part framed in the second picture , Judging from the address, the whole memory has been changed , But after exporting and stringing the memory data , There is no effective information for further investigation .


Conclusion
- The question is related to kong The upgrade (0.14 --> 2.2.0) It doesn't matter. , Use it directly 2.2.0 Version will also have this problem ;
- kong every other
worker_state_update_frequencyIt will be rebuilt in memory after time router, Once reconstruction starts, it will lead to Memory goes up , After looking at the code, the problem isRouter.newHere's the way , Will apply for lrucache But there is no flush_all, According to the latest 2.8.1 Version of lrucache After the release, the problem still exists ; - That is to say kong Of
Router.newWhen other logic in the method arrives, the memory rises ;

- This shows that the problem is kong There is a performance bug, It still exists in the latest version , When route Follow service When reaching a certain order of magnitude, there will be calls admin Interface , Lead to kong Of worker Memory is rising rapidly , bring oom This leads to poor business access performance , The temporary solution can be to reduce
NGINX_WORKER_PROCESSESAnd increase kong pod Of memory , Make sure to call admin The memory required after the interface is enough to use without triggering oom, To ensure the normal use of business .
Last , We will be in https://github.com/Kong/kong/issues/7543 This issue Add this phenomenon , You are welcome to continue to pay attention , Discuss together ~
For more technical dry goods, please pay attention to 【 Erda Erda】 official account , Grow with many open source enthusiasts ~
边栏推荐
- Leetcode linked list problem -- 24. Exchange the nodes in the linked list in pairs (learn the linked list with one question and one article)
- Correct the classpath of your application so that it contains compatible versions of the classes com
- GOM登录器配置免费版生成图文教程
- 【问题篇】将集合[‘‘,‘‘]处理成(‘‘,‘‘)
- [interview brush 101] dynamic planning 1
- Rare discounts on Apple's official website, with a discount of 600 yuan for all iphone13 series; Chess robot injured the fingers of chess playing children; Domestic go language lovers launch a new pro
- Error in render: “TypeError: data.slice is not a function“
- What kind of security problems will the server encounter?
- SSM integration example
- What is the function of the serializable interface?
猜你喜欢

GOM login configuration free version generate graphic tutorial

Summary of 4 years of software testing experience and interviews with more than 20 companies after job hopping

Establishment of APP automation testing framework (VIII) -- establishment of ATX server2 multi device cluster environment

Kotlin - coroutinecontext

Error in render: “TypeError: data.slice is not a function“

BTC和ETH不确定性增强 因加息逼近?美国经济将面临更多痛苦

Beginner experience of safety testing

【Oracle实训】-部署号称零停机迁移的OGG

Flutter Performance Optimization Practice - UI chapter

09_ue4进阶_进入下一关并保留血量
随机推荐
Pointpillars: fast encoders for object detection from point clouds reading notes
AI technology, simplifying the complex world | teatalk online application practical series, issue 2
NVIDIA canvas first experience~
QT基础第一天 (1)QT,GUI(图形用户接口)开发
Why can ThreadLocal achieve thread isolation?
[interview brush 101] dynamic planning 1
Can the training software test be employed
Chapter 2: encounter obstacles! Bypass WAF filtering! [SQL injection attack]
What is the origin of CNEX labs, which let Huawei lose the lawsuit?
使用 LSTM 进行多变量时间序列预测--问题汇总
Error in render: “TypeError: data.slice is not a function“
Deployment of kubernetes
Beginner experience of safety testing
银河证券场内基金低佣金开户靠谱吗,可靠安全吗
每日练习------有一组学员的成绩,将它们按降序排列,要增加一个学员的成绩,将它插入成绩序列,并保持降序
Buu brush inscription 3
2022 pole technology communication - anmou technology opens a new chapter of commercialization
New features of ES6
The lawyer team of the US Department of justice asked the judge to refuse to accept Huawei's lawsuit
Marketing and sales document management and workflow solutions