当前位置:网站首页>We were tossed all night by a Kong performance bug
We were tossed all night by a Kong performance bug
2022-07-25 17:10:00 【Open source small e】
The story background
stay Erda In the technical architecture of , We used kong As API Technology selection of gateway . Because it has the characteristics of high concurrency and low latency , Combined with Kubernetes Ingress Controller, Declarative configuration based on cloud native , Can achieve rich API Strategy .
In our earliest delivered cluster ,kong It is still relatively early 0.14 edition , With the increasing requirements of business level for security , We need to be based on kong Implement security plug-ins , Help the system to have better security capabilities . Due to the earlier 0.14 Version cannot be used go-pluginserver To expand kong Plug in mechanism of , We have to put kong Upgrade to a relatively new 2.2.0 edition .
The upgrade process will not be repeated here , Basically, it is upgraded smoothly step by step according to the official documents , But in the days after the upgrade , our SRE The team received intensive consultation and even criticism , Businesses deployed on the cluster Intermittently inaccessible , Very high latency .
A series of failed attempts
Parameter tuning
At first, in order to quickly fix this problem , We are right. kong Of NGINX_WORKER_PROCESSES、MEM_CACHE_SIZE、 DB_UPDATE_FREQUENCY、WORKER_STATE_UPDATE_FREQUENCY Parameters and postgres Of work_mem、 share_buffers Have been properly tuned .
however , No effect .
Clean up the data
Due to the historical reasons of this cluster , Will register or delete frequently api data , Therefore, about 5 More than 10000 articles route perhaps service data .
We suspect that the performance degradation is caused by the large amount of data , And then combine erda Data pairs in kong Delete the historical data in , In the process of deletion, the deletion is slow and at the same time kong A sharp decline in performance .
After several tests, we determined 「 Just call admin The interface of leads to kong Performance degradation 」 This conclusion , It is very similar to the problem of the community , Links are as follows :
https://github.com/Kong/kong/issues/7543
kong Instance read-write separation
I'm sure it's admin After the reason of the interface , We decided to admin Business related kong Instance separation , hope admin The call of will not affect the normal traffic access of the business , Expect to achieve kong Of admin Slow interface , But don't affect the access performance of the business .
However , No effect .
postgres transfer RDS
kong After the efforts at the level are fruitless , We also observed when calling admin Interface test ,postgres The process of has also increased a lot ,CPU The utilization rate has also increased , Also decided to pg Migrate to More professional RDS in .
still , No effect .
Roll back
Finally, we rolled back to 0.14 edition , Pursue temporary “ Peace of mind ”.
thus , The online attempt is basically a paragraph , It also roughly finds out the conditions for the recurrence of the problem , So we decided to build an environment offline to continue to find out the cause of the problem .
The way to reproduce the problem
We will have problems kong Of postgres Import a copy of the data into the development environment , simulation 「 call admin Interface is a sharp decline in performance 」 The situation of , And find a solution .
Import data
We will have problems in the cluster postgre After the data is backed up, it is imported into a new cluster :
psql -h 127.0.0.1 -U kong < kong.sqlAnd turn it on kong Of prometheus plug-in unit , Easy to use grafana To view the performance icon :
curl -X POST http://10.97.4.116:8001/plugins --data "name=prometheus"Phenomenon one
call admin Service / Same slow , It is consistent with the online phenomenon , Call when the amount of data is large admin Of / Directories take more time .
curl http://10.97.4.116:8001Phenomenon two
Then let's simulate the call encountered online admin Poor service access performance after interface , First call admin Interface to create a business api, For the test , We created one service And one. routes:
curl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu2' -d 'url=http://www.baidu.com'
curl -i -X POST http://10.97.4.116:8001/services/baidu2/routes \-d "name=test2" \ -d "paths[1]=/baidu2" You can use it later curl http://10.97.4.116:8000/baidu2 To simulate the business interface for testing .
Get ready admin Interface test script , Create and delete a service/route, Intersperse one in the middle service list.
#!/bin/bash
curl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu' -d 'url=http://www.baidu.com'
curl -i -X POST http://10.97.4.116:8001/services/baidu/routes \
-d "name=test" \
-d "paths[1]=/baidu"
curl -s http://10.97.4.116:8001/services
curl -i -X DELETE http://10.97.4.116:8001/services/baidu/routes/test
Then continue to call the script :
for i in `seq 1 100`; do sh 1.sh ; doneIn the process of continuously calling the script, access a business interface , You will find it very slow , It is completely consistent with the online phenomenon .
curl http://10.97.4.116:8000/baidu2PS: Streamline scripts , Only one write is triggered after , Or deletion will also trigger this phenomenon
Accompanying phenomenon
- kong Example of cpu Follow mem Both continue to rise , And when admin This phenomenon is still not over after the interface call .mem It will rise to a certain extent nginx worker process oom fall , And then restart , This may be the reason for the slow access ;
- We set it up
KONG_NGINX_WORKER_PROCESSESby 4, And for pod The memory of is 4G When ,pod The overall memory will be stable at 2.3G, however call admin Interface test ,pod Memory will keep rising to more than 4G, Trigger worker Of OOM, So I will pod The memory of is adjusted to 8G. Call again admin Interface , Find out pod Memory is still rising , It just rose to 4.11 G It's over , This seems to mean that we are going to set pod The memory of isKONG_NGINX_WORKER_PROCESSEStwice as much , This problem is solved ( But there is another important question is why to call once admin Interface , It will cause the memory to rise so much ); - in addition , When I keep calling admin At the interface , The final memory will continue to grow and stabilize to 6.9G.
At this time, we will abstract the problem :
call 「kong admin Interface 」 Cause the memory to keep rising , And then trigger oom Lead to worker By kill fall , Eventually, business access is slow .
Continue to investigate what is taking up memory :
I use pmap -x [pid] I checked it twice worker Memory distribution of the process , What changes is the part framed in the second picture , Judging from the address, the whole memory has been changed , But after exporting and stringing the memory data , There is no effective information for further investigation .
Conclusion
- The question is related to kong The upgrade (0.14 --> 2.2.0) It doesn't matter. , Use it directly 2.2.0 Version will also have this problem ;
- kong every other
worker_state_update_frequencyIt will be rebuilt in memory after time router, Once reconstruction starts, it will lead to Memory goes up , After looking at the code, the problem isRouter.newHere's the way , Will apply for lrucache But there is noflush_all, According to the latest 2.8.1 Version of lrucache After the release, the problem still exists ; - That is to say kong Of
Router.newWhen other logic in the method arrives, the memory rises ;
- This shows that the problem is kong There is a performance bug, It still exists in the latest version , When route Follow service When reaching a certain order of magnitude, there will be calls admin Interface , Lead to kong Of worker Memory is rising rapidly , bring oom This leads to poor business access performance , The temporary solution can be to reduce
NGINX_WORKER_PROCESSESAnd increase kong pod Of memory , Make sure to call admin The memory required after the interface is enough to use without triggering oom, To ensure the normal use of business .
Last , We will be in https://github.com/Kong/kong/issues/7543 This issue Add this phenomenon , You are welcome to continue to pay attention , Discuss together ~
边栏推荐
- 博云容器云、DevOps平台斩获可信云“技术最佳实践奖”
- 华泰vip账户证券开户安全吗
- POWERBOARD coco! Dino: let target detection embrace transformer
- [Nanjing University of Aeronautics and Astronautics] information sharing for the first and second examinations of postgraduate entrance examination
- Unity is better to use the hot scheme Wolong
- [knowledge atlas] practice -- Practice of question and answer system based on medical knowledge atlas (Part5 end): information retrieval and result assembly
- Mindoc makes mind map
- mindoc制作思维导图
- 2022年最新北京建筑施工焊工(建筑特种作业)模拟题库及答案解析
- Enumeration classes and magic values
猜你喜欢

如何使用 4EVERLAND CLI 在 IPFS 上部署应用程序

EasyUI drop-down box, add and put on and off shelves of products

3D semantic segmentation - PVD

Data analysis and privacy security become the key factors for the success or failure of Web3.0. How do enterprises layout?

Exception handling mechanism topic 1

搜狗批量推送软件-搜狗批量推送工具【2022最新】

博云容器云、DevOps平台斩获可信云“技术最佳实践奖”

Why 4everland is the best cloud computing platform for Web 3.0

Who moved my memory and revealed the secret of 90% reduction in oom crash

Chapter VI succession
随机推荐
月薪1万在中国是什么水平?答案揭露残酷的收入真相
Is it safe to open a securities account in Huatai VIP account
The gas is exhausted! After 23 years of operation, the former "largest e-commerce website in China" has become yellow...
数据分析与隐私安全成 Web3.0 成败关键因素,企业如何布局?
大型仿人机器人的技术难点和应用情况
从数字化到智能运维:有哪些价值,又有哪些挑战?
Outlook 教程,如何在 Outlook 中搜索日历项?
如何使用 4EVERLAND CLI 在 IPFS 上部署应用程序
虚拟内存管理
Unity is better to use the hot scheme Wolong
Text translation software - text batch translation converter free of charge
我们被一个 kong 的性能 bug 折腾了一个通宵
2022 latest Beijing Construction welder (construction special operation) simulation question bank and answer analysis
01.两数之和
EasyUI drop-down box, add and put on and off shelves of products
Technical difficulties and applications of large humanoid robots
2D semantic segmentation -- deeplabv3plus reproduction
多租户软件开发架构
stm32F407------SPI
Using rank to discuss the solution of linear equations / the positional relationship of three planes