当前位置：网站首页>We were tossed all night by a Kong performance bug

We were tossed all night by a Kong performance bug

2022-07-24 09:08:00 【Hua Weiyun】

The story background

stay Erda In the technical architecture of , We used kong As API Technology selection of gateway . Because it has the characteristics of high concurrency and low latency , Combined with Kubernetes Ingress Controller, Declarative configuration based on cloud native , Can achieve rich API Strategy .

In our earliest delivered cluster ,kong It is still relatively early 0.14 edition , With the increasing requirements of business level for security , We need to be based on kong Implement security plug-ins , Help the system to have better security capabilities . Due to the earlier 0.14 Version cannot be used go-pluginserver To expand kong Plug in mechanism of , We have to put kong Upgrade to a relatively new 2.2.0 edition .

The upgrade process will not be repeated here , Basically, it is upgraded smoothly step by step according to the official documents , But in the days after the upgrade , our SRE The team received intensive consultation and even criticism , Businesses deployed on the cluster Intermittently inaccessible , Very high latency .

A series of failed attempts

Parameter tuning

At first, in order to quickly fix this problem , We are right. kong Of NGINX_WORKER_PROCESSES、MEM_CACHE_SIZE、 DB_UPDATE_FREQUENCY、WORKER_STATE_UPDATE_FREQUENCY Parameters as well as postgres Of work_mem、 share_buffers Have been properly tuned .

however , No effect .

Clean up the data

Due to the historical reasons of this cluster , Will register or delete frequently api data , Therefore, about 5 More than 10000 articles route perhaps service data .

We suspect that the performance degradation is caused by the large amount of data , And then combine erda Data pairs in kong Delete the historical data in , In the process of deletion, the deletion is slow and at the same time kong A sharp decline in performance .

After several tests, we determined 「 Just call admin The interface of leads to kong Performance degradation 」 This conclusion , It is very similar to the problem of the community , Links are as follows ：

https://github.com/Kong/kong/issues/7543

kong Instance read-write separation

I'm sure it's admin After the reason of the interface , We decided to admin Business related kong Instance separation , hope admin The call of will not affect the normal traffic access of the business , Expect to achieve kong Of admin Slow interface , But don't affect the access performance of the business .

However , No effect .

postgres transfer RDS

kong After the efforts at the level are fruitless , We also observed when calling admin Interface test ,postgres The process of has also increased a lot ,CPU The utilization rate has also increased , Also decided to pg Migrate to More professional RDS in .

still , No effect .

Roll back

Finally, we rolled back to 0.14 edition , Pursue temporary “ Peace of mind ”.

thus , The online attempt is basically a paragraph , It also roughly finds out the conditions for the recurrence of the problem , So we decided to build an environment offline to continue to find out the cause of the problem .

The way to reproduce the problem

We will have problems kong Of postgres Import a copy of the data into the development environment , simulation 「 call admin Interface is a sharp decline in performance 」 The situation of , And find a solution .

Import data

We will have problems in the cluster postgre After the data is backed up, it is imported into a new cluster :

psql -h 127.0.0.1 -U kong < kong.sql

And turn it on kong Of prometheus plug-in unit , Easy to use grafana To view the performance icon :

curl -X POST http://10.97.4.116:8001/plugins --data "name=prometheus"

Phenomenon one

call admin Service / Same slow , It is consistent with the online phenomenon , Call when the amount of data is large admin Of / Directories take more time .

curl http://10.97.4.116:8001

Phenomenon two

Then let's simulate the call encountered online admin Poor service access performance after interface , First call admin Interface to create a business api, For the test , We created one service And one. routes：

curl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu2'  -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu2/routes \-d "name=test2" \     -d "paths[1]=/baidu2"

You can use it later curl http://10.97.4.116:8000/baidu2 To simulate the business interface for testing .

Get ready admin Interface test script , Create and delete a service/route, Intersperse one in the middle service list.

#!/bin/bashcurl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu'  -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu/routes \     -d "name=test" \     -d "paths[1]=/baidu" curl -s http://10.97.4.116:8001/servicescurl -i -X DELETE      http://10.97.4.116:8001/services/baidu/routes/test

Then continue to call the script ：

for i in `seq 1 100`; do sh 1.sh ; done

In the process of continuously calling the script, access a business interface , You will find it very slow , It is completely consistent with the online phenomenon .

curl  http://10.97.4.116:8000/baidu2

PS： Streamline scripts , Only one write is triggered after , Or deletion will also trigger this phenomenon

Accompanying phenomenon

kong Example of cpu Follow mem Both continue to rise , And when admin This phenomenon is still not over after the interface call .mem It will rise to a certain extent nginx worker process oom fall , And then restart , This may be the reason for the slow access ;
We set it up KONG_NGINX_WORKER_PROCESSES by 4, And for pod The memory of is 4G When ,pod The overall memory will be stable at 2.3G, however call admin Interface test ,pod Memory will keep rising to more than 4G, Trigger worker Of OOM, So I will pod The memory of is adjusted to 8G. Call again admin Interface , Find out pod Memory is still rising , It just rose to 4.11 G It's over , This seems to mean that we are going to set pod The memory of is KONG_NGINX_WORKER_PROCESSES twice as much , This problem is solved （ But there is another important question is why to call once admin Interface , It will cause the memory to rise so much ）;
in addition , When I keep calling admin At the interface , The final memory will continue to grow and stabilize to 6.9G.

At this time, we will abstract the problem ：

call 「kong admin Interface 」 Cause the memory to keep rising , And then trigger oom Lead to worker By kill fall , Eventually, business access is slow .

Continue to investigate what is taking up memory ：

I use pmap -x [pid] I checked it twice worker Memory distribution of the process , What changes is the part framed in the second picture , Judging from the address, the whole memory has been changed , But after exporting and stringing the memory data , There is no effective information for further investigation .

Conclusion

The question is related to kong The upgrade （0.14 --> 2.2.0） It doesn't matter. , Use it directly 2.2.0 Version will also have this problem ;
kong every other worker_state_update_frequency It will be rebuilt in memory after time router, Once reconstruction starts, it will lead to Memory goes up , After looking at the code, the problem is Router.new Here's the way , Will apply for lrucache But there is no flush_all, According to the latest 2.8.1 Version of lrucache After the release, the problem still exists ;
That is to say kong Of Router.new When other logic in the method arrives, the memory rises ;

This shows that the problem is kong There is a performance bug, It still exists in the latest version , When route Follow service When reaching a certain order of magnitude, there will be calls admin Interface , Lead to kong Of worker Memory is rising rapidly , bring oom This leads to poor business access performance , The temporary solution can be to reduce NGINX_WORKER_PROCESSES And increase kong pod Of memory , Make sure to call admin The memory required after the interface is enough to use without triggering oom, To ensure the normal use of business .

Last , We will be in https://github.com/Kong/kong/issues/7543 This issue Add this phenomenon , You are welcome to continue to pay attention , Discuss together ～

Erda Have already used License On the cloud market , Huawei cloud commodity link ：https://marketplace.huaweicloud.com/contents/a1210d06-82af-4552-915d-d3d9d10a13ea

原网站

版权声明
本文为[Hua Weiyun]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/204/202207221703023782.html