当前位置:网站首页>We were tossed all night by a Kong performance bug
We were tossed all night by a Kong performance bug
2022-07-26 13:21:00 【Erda technical team】
This article is about 3500 word , Estimated reading time :9 minute
stay Erda In the technical architecture of , We used kong As API Technology selection of gateway . Because it has the characteristics of high concurrency and low latency , Combined with Kubernetes Ingress Controller, Declarative configuration based on cloud native , Can achieve rich API Strategy .
Parameter tuning NGINX_WORKER_PROCESSES、MEM_CACHE_SIZE、 DB_UPDATE_FREQUENCY、WORKER_STATE_UPDATE_FREQUENCY Parameters as well as postgres Of work_mem、 share_buffers Have been properly tuned .however , No effect .
Clean up the data
kong Instance read-write separation I'm sure it's admin After the reason of the interface , We decided to admin Business related kong Instance separation , hope admin The call of will not affect the normal traffic access of the business , Expect to achieve kong Of admin Slow interface , But don't affect the access performance of the business .
However , No effect .
postgres transfer RDSkong After the efforts at the level are fruitless , We also observed when calling admin Interface test ,postgres The process of has also increased a lot ,CPU The utilization rate has also increased , Also decided to pg Migrate to More professional RDS in .
still , No effect .
Roll back Finally, we rolled back to 0.14 edition , Pursue temporary “ Peace of mind ”.
thus , The online attempt is basically a paragraph , It also roughly finds out the conditions for the recurrence of the problem , So we decided to build an environment offline to continue to find out the cause of the problem .
Import data We will have problems in the cluster postgre After the data is backed up, it is imported into a new cluster :
psql -h 127.0.0.1 -U kong < kong.sql
And turn it on kong Of prometheus plug-in unit , Easy to use grafana To view the performance icon :
curl -X POST http://10.97.4.116:8001/plugins --data "name=prometheus"
Phenomenon one curl http://10.97.4.116:8001
Phenomenon two Then let's simulate the call encountered online admin Poor service access performance after interface , First call admin Interface to create a business api, For the test , We created one service And one. routes:
curl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu2' -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu2/routes \-d "name=test2" \-d "paths[1]=/baidu2"
curl http://10.97.4.116:8000/baidu2 To simulate the business interface for testing .#!/bin/bashcurl -i -X POST http://10.97.4.116:8001/services/ -d 'name=baidu' -d 'url=http://www.baidu.com'curl -i -X POST http://10.97.4.116:8001/services/baidu/routes \-d "name=test" \-d "paths[1]=/baidu"curl -s http://10.97.4.116:8001/servicescurl -i -X DELETE http://10.97.4.116:8001/services/baidu/routes/test
for i in `seq 1 100`; do sh 1.sh ; donecurl http://10.97.4.116:8000/baidu2
Accompanying phenomenon kong Example of cpu Follow mem Both continue to rise , And when admin This phenomenon is still not over after the interface call .mem It will rise to a certain extent nginx worker process oom fall , And then restart , This may be the reason for the slow access ;
We set it up
KONG_NGINX_WORKER_PROCESSESby 4, And for pod The memory of is 4G When ,pod The overall memory will be stable at 2.3G, however call admin Interface test ,pod Memory will keep rising to more than 4G, Trigger worker Of OOM, So I will pod The memory of is adjusted to 8G. Call again admin Interface , Find out pod Memory is still rising , It just rose to 4.11 G It's over , This seems to mean that we are going to set pod The memory of isKONG_NGINX_WORKER_PROCESSEStwice as much , This problem is solved ( But there is another important question is why to call once admin Interface , It will cause the memory to rise so much );in addition , When I keep calling admin At the interface , The final memory will continue to grow and stabilize to 6.9G.
At this time, we will abstract the problem :
Continue to investigate what is taking up memory :
pmap -x [pid] I checked it twice worker Memory distribution of the process , What changes is the part framed in the second picture , Judging from the address, the whole memory has been changed , But after exporting and stringing the memory data , There is no effective information for further investigation .

The question is related to kong The upgrade (0.14 --> 2.2.0) It doesn't matter. , Use it directly 2.2.0 Version will also have this problem ; kong every other worker_state_update_frequencyIt will be rebuilt in memory after time router, Once reconstruction starts, it will lead to Memory goes up , After looking at the code, the problem isRouter.newHere's the way , Will apply for lrucache But there is noflush_all, According to the latest 2.8.1 Version of lrucache After the release, the problem still exists ;That is to say kong Of Router.newWhen other logic in the method arrives, the memory rises ;

This shows that the problem is kong There is a performance bug, It still exists in the latest version , When route Follow service When reaching a certain order of magnitude, there will be calls admin Interface , Lead to kong Of worker Memory is rising rapidly , bring oom This leads to poor business access performance , The temporary solution can be to reduce NGINX_WORKER_PROCESSESAnd increase kong pod Of memory , Make sure to call admin The memory required after the interface is enough to use without triggering oom, To ensure the normal use of business .
———
Erda Github Address : https://github.com/erda-project/erda Erda Cloud Official website : https://www.erda.cloud/

This article is from WeChat official account. - Erda Erda(gh_0f507c84dfb0).
If there is any infringement , Please contact the [email protected] Delete .
Participation of this paper “OSC Source creation plan ”, You are welcome to join us , share .
边栏推荐
- The child component triggers the defineemits of the parent component: the child component passes values to the parent component
- LeetCode 217. 存在重复元素
- 解决方案丨5G技术助力搭建智慧园区
- PostgreSQL official website download error
- pomerium
- Precautions for triggering pytest.main() from other files
- Example of establishing socket communication with Siemens PLC based on C # open TCP communication
- The best engineer was "forced" away by you like this!
- 【TypeScript】TypeScript常用类型(上篇)
- B+树索引使用(9)分组、回表、覆盖索引(二十一)
猜你喜欢

目标检测网络R-CNN 系列

Target detection network r-cnn series

Unicode文件解析方法及存在问题

Implementation of SAP ABAP daemon

JSON数据传递参数&日期型参数传递

Unity中序列化类为json格式

12 brand management of commodity system in gulimall background management

Kubernetes APIServer 限流策略

MVVM architecture encapsulation of kotlin series (kotlin+mvvm)

基于C#实现的学生考试系统
随机推荐
12 brand management of commodity system in gulimall background management
父组件访问子组件的方法或参数 (子组件暴漏出方法defineExpose)
二叉树的初阶笔记
Sword finger offer (21): push in and pop-up sequence of stack
Px2rem loader converts PX into REM and adapts to mobile vant UI and other frameworks
The best engineer was "forced" away by you like this!
PostgreSQL official website download error
Version of NDK matched the requested version 21.0.6113669. versions available locally: 2
Golang端口扫描设计
Reflection, an implementation of automatic repeated call interface
[5g] what are Cu and Du in 5g?
冒泡排序的时间复杂度分析
12-GuliMall 后台管理中商品系统的品牌管理
Use flex to realize left middle right layout and middle adaptation
Student examination system based on C #
Emotion analysis model based on Bert
关于自动重复调用接口的一种实现方式-反射
Leetcode 263. ugly number
从其他文件触发pytest.main()注意事项
Kubernetes apiserver current limiting strategy