当前位置:网站首页>We were tossed all night by a Kong performance bug
We were tossed all night by a Kong performance bug
2022-07-26 17:32:00 【51CTO】
The story background
stay Erda In the technical architecture of , We used kong As API Technology selection of gateway . Because it has the characteristics of high concurrency and low latency , Combined with Kubernetes Ingress Controller, Declarative configuration based on cloud native , Can achieve rich API Strategy .
In our earliest delivered cluster ,kong It is still relatively early 0.14 edition , With the increasing requirements of business level for security , We need to be based on kong Implement security plug-ins , Help the system to have better security capabilities . Due to the earlier 0.14 Version cannot be used go-pluginserver To expand kong Plug in mechanism of , We have to put kong Upgrade to a relatively new 2.2.0 edition .
The upgrade process will not be repeated here , Basically, it is upgraded smoothly step by step according to the official documents , But in the days after the upgrade , our SRE The team received intensive consultation and even criticism , Businesses deployed on the cluster Intermittent inaccessibility , Very high latency .
A series of failed attempts
Parameter tuning
At first, in order to quickly fix this problem , We are right. kong Of NGINX_WORKER_PROCESSES、MEM_CACHE_SIZE、 DB_UPDATE_FREQUENCY、WORKER_STATE_UPDATE_FREQUENCY Parameters and postgres Of work_mem、 share_buffers Have been properly tuned .
however , No effect .
Clean up the data
Due to the historical reasons of this cluster , Will register or delete frequently api data , Therefore, about 5 More than 10000 articles route perhaps service data .
We suspect that the performance degradation is caused by the large amount of data , And then combine erda Data pairs in kong Delete the historical data in , In the process of deletion, the deletion is slow and at the same time kong A sharp decline in performance .
After several tests, we determined 「 Just call admin The interface of leads to kong Performance degradation 」 This conclusion , It is very similar to the problem of the community , Links are as follows :
https://github.com/Kong/kong/issues/7543
kong Instance read-write separation
I'm sure it's admin After the reason of the interface , We decided to admin Business related kong Instance separation , hope admin The call of will not affect the normal traffic access of the business , Expect to achieve kong Of admin Slow interface , But don't affect the access performance of the business .
However , No effect .
postgres transfer RDS
kong After the efforts at the level are fruitless , We also observed when calling admin Interface test ,postgres The process of has also increased a lot ,CPU The utilization rate has also increased , Also decided to pg Migrate to More professional RDS in .
still , No effect .
Roll back
Finally, we rolled back to 0.14 edition , Pursue temporary “ Peace of mind ”.
thus , The online attempt is basically a paragraph , It also roughly finds out the conditions for the recurrence of the problem , So we decided to build an environment offline to continue to find out the cause of the problem .
The way to reproduce the problem
We will have problems kong Of postgres Import a copy of the data into the development environment , simulation 「 call admin Interface is a sharp decline in performance 」 The situation of , And find a solution .
Import data
We will have problems in the cluster postgre After the data is backed up, it is imported into a new cluster :
And turn it on kong Of prometheus plug-in unit , Easy to use grafana To view the performance icon :
Phenomenon one
call admin Service / Same slow , It is consistent with the online phenomenon , Call when the amount of data is large admin Of / Directories take more time .

Phenomenon two
Then let's simulate the call encountered online admin Poor service access performance after interface , First call admin Interface to create a business api, For the test , We created one service And one. routes:
You can use it later curl http://10.97.4.116:8000/baidu2 To simulate the business interface for testing .
Get ready admin Interface test script , Create and delete a service/route, Intersperse one in the middle service list.
Then continue to call the script :
In the process of continuously calling the script, access a business interface , You will find it very slow , It is completely consistent with the online phenomenon .

PS: Streamline scripts , Only one write is triggered after , Or deletion will also trigger this phenomenon
Accompanying phenomenon
- kong Example of cpu Follow mem Both continue to rise , And when admin This phenomenon is still not over after the interface call .mem It will rise to a certain extent nginx worker process oom fall , And then restart , This may be the reason for the slow access ;
- We set it up
KONG_NGINX_WORKER_PROCESSES by 4, And for pod The memory of is 4G When ,pod The overall memory will be stable at 2.3G, however call admin Interface test ,pod Memory will keep rising to more than 4G, Trigger worker Of OOM, So I will pod The memory of is adjusted to 8G. Call again admin Interface , Find out pod Memory is still rising , It just rose to 4.11 G It's over , This seems to mean that we are going to set pod The memory of is KONG_NGINX_WORKER_PROCESSES twice as much , This problem is solved ( But there is another important question is why to call once admin Interface , It will cause the memory to rise so much ); - in addition , When I keep calling admin At the interface , The final memory will continue to grow and stabilize to 6.9G.
At this time, we will abstract the problem :
call 「kong admin Interface 」 Cause the memory to keep rising , And then trigger oom Lead to worker By kill fall , Eventually, business access is slow .
Continue to investigate what is taking up memory :
I use pmap -x [pid] I checked it twice worker Memory distribution of the process , What changes is the part framed in the second picture , Judging from the address, the whole memory has been changed , But after exporting and stringing the memory data , There is no effective information for further investigation .


Conclusion
- The question is related to kong The upgrade (0.14 --> 2.2.0) It doesn't matter. , Use it directly 2.2.0 Version will also have this problem ;
- kong every other
worker_state_update_frequency It will be rebuilt in memory after time router, Once reconstruction starts, it will lead to Memory goes up , After looking at the code, the problem is Router.new Here's the way , Will apply for lrucache But there is no flush_all, According to the latest 2.8.1 Version of lrucache After the release, the problem still exists ; - That is to say kong Of
Router.new When other logic in the method arrives, the memory rises ;

- This shows that the problem is kong There is a performance bug, It still exists in the latest version , When route Follow service When reaching a certain order of magnitude, there will be calls admin Interface , Lead to kong Of worker Memory is rising rapidly , bring oom This leads to poor business access performance , The temporary solution can be to reduce
NGINX_WORKER_PROCESSES And increase kong pod Of memory , Make sure to call admin The memory required after the interface is enough to use without triggering oom, To ensure the normal use of business .
Last , We will be in https://github.com/Kong/kong/issues/7543
For more technical dry goods, please pay attention to 【 Erda Erda】 official account , Grow with many open source enthusiasts ~
边栏推荐
- 硬件开发与市场产业
- Interface comparator
- Environment setup mongodb
- VIM visualization mode and its usage
- 徽商期货网上开户安全吗?开户办理流程是怎样的?
- The principle of reliable transmission in TCP protocol
- How to use different tools to analyze and optimize code performance when CPU utilization is high
- 大家下午好,请教一个问题:如何从保存点启动一个之前以SQL提交的作业?问题描述:用SQL在cl
- The user experience center of Analysys Qianfan bank was established to help upgrade the user experience of the banking industry
- Tensor operation in pytoch
猜你喜欢

Why are test / development programmers who are better paid than me? Abandoned by the times

Everything is available Cassandra: the fairy database behind Huawei tag

Crazy God redis notes 02
![[basic course of flight control development 2] crazy shell · open source formation UAV - timer (LED flight information light and indicator light flash)](/img/ad/e0bc488c238a260768f7e7faec87d0.png)
[basic course of flight control development 2] crazy shell · open source formation UAV - timer (LED flight information light and indicator light flash)

SCCM tips - improve the download speed of drivers and shorten the deployment time of the system when deploying the system

Pay attention to the traffic safety warning of tourism passenger transport issued by the Ministry of public security

我们被一个 kong 的性能 bug 折腾了一个通宵

Anaconda download and Spyder error reporting solution

Advantages of time series database and traditional database

How to ensure cache and database consistency
随机推荐
Batch normalization batch_ normalization
徽商期货网上开户安全吗?开户办理流程是怎样的?
Crazy God redis notes 02
Pass-19,20
FIR filter design
Week 4 Recurrent Neural Networks
Data preprocessing of machine learning
[basic course of flight control development 1] crazy shell · open source formation UAV GPIO (LED flight information light and signal light control)
OA项目之我的会议(会议排座&送审)
Implementing dropout with mxnet from zero sum
得不偿失!博士骗领210万元、硕士骗领3万元人才补贴,全被判刑了!
6-19漏洞利用-nsf获取目标密码文件
Analysis of the advantages of eolink and JMeter interface testing
注意 公安部发出旅游客运交通安全预警
The first self-developed embedded 40nm industrial scale memory chip in China was released, breaking the status quo that the localization rate is zero
敏捷开发与DevOps的对比
How to connect tdengine with idea database tool?
[development tutorial 7] crazy shell · open source Bluetooth heart rate waterproof sports Bracelet - capacitive touch
浅谈云原生边缘计算框架演进
Pay attention to the traffic safety warning of tourism passenger transport issued by the Ministry of public security