当前位置:网站首页>Runc hang causes the kubernetes node notready
Runc hang causes the kubernetes node notready
2022-07-05 01:39:00 【Shoot cloud again】
Kubernetes 1.19.3
OS: CentOS 7.9.2009
Kernel: 5.4.94-1.el7.elrepo.x86_64
Docker: 20.10.6
Say first conclusion ,runc v1.0.0-rc93 Yes bug, It can lead to docker hang live .
Find the problem
Online alarm indicates that there is 2-3 individual K8s The node is in NotReady The state of , also NotReady Status continues .
- kubectl describe node, Yes NotReady Related events .
After logging in to the problem machine , Check the node load , Everything is all right .
see kubelet journal , Find out PLEG drawn-out , Causes the node to be marked NotReady.
docker ps normal .
perform ps Check the process , There are several runc init The process of .runc yes containerd Called when the container is started OCI Runtime Program . The initial suspicion is that docker hang Live in the .
There are two ways to solve this problem , So let's look at this first A programme .
Solution A
in the light of docker hang Live in such a phenomenon , After searching the information, I found that the following two articles also encountered similar problems :
docker hang Troubleshoot problems [https://www.likakuli.com/posts/docker-hang/]
Docker hung Housing problem analysis series ( One ):pipe Not enough capacity [https://juejin.cn/post/6891559762320703495]
The reason mentioned in both articles is pipe Insufficient capacity leads to runc init Go to pipe The writing is stuck , take /proc/sys/fs/pipe-user-pages-soft Let go of the restrictions on , Can solve the problem .
therefore , Check the problem host /proc/sys/fs/pipe-user-pages-soft The settings are 16384. So zoom it in 10 times echo 163840 > /proc/sys/fs/pipe-user-pages-soft, However kubelet Still not back to normal ,pleg The error log continues ,runc init The program did not exit .
in consideration of runc init yes kubelet call CRI Interface created , It may be necessary to runc init Exit to make kubelet sign out . According to the description in the article , Just put the corresponding pipe Read the contents in ,runc init You can quit . Because reading pipe The content of can be used 「UNIX/Linux Everything is a document 」 Principles , adopt lsof -p see runc init Open handle information , Get write type pipe The corresponding number ( There could be multiple ), Execute sequentially cat /proc/ p i d / f d / pid/fd/ pid/fd/id The way , Read pipe The content in . After a few attempts ,runc init Sure enough, I quit .
Check again , The node state is switched to Ready,pleg The error log also disappeared , No node appeared after observation for a day NotReady The situation of , problem ( temporary ) solve .
For solutions A doubt
Although the problem has been solved , But read carefully /proc/sys/fs/pipe-user-pages-soft Parameter description document , It is not difficult to find that this parameter is not quite right with the root cause of this problem .
pipe-user-pages-soft The meaning is no CAP_SYS_RESOURCE CAP_SYS_ADMIN Permission users use pipe Limit the capacity , By default, you can only use 1024 individual pipe, One pipe The capacity is 16k.
Then there is a question :
dockerd/containerd/kubelet And other components pass root User run , also runc init In container initialization stage , Theoretically, it will not 1024 individual pipe Consumed . therefore ,pipe-user-pages-soft It won't be right docker hang This problem has an impact , But the problem disappears after the actual parameters are amplified , There is no explanation .
pipe The capacity is fixed , The user is creating pipe Cannot declare capacity . Look at it online ,pipe It was indeed built , If the capacity is fixed , It should not be because users use pipe More than pipe-user-pages-soft Limit , Which leads to the problem of being unable to write . Is it newly created pipe The capacity becomes smaller , Result in data that can be written originally , Cannot write this time ?
At present pipe-user-pages-soft Magnified 10 times , Zoom in 2 Is it enough , Which value is the most appropriate value ?
Explore
The most direct way to locate the problem , Is to read the source code .
Check first Linux Kernel heel pipe-user-pages-soft Related code . The online kernel version is 5.4.94-1, Switch to the corresponding version for retrieval .
static bool too_many_pipe_buffers_soft(unsigned long user_bufs)
{
unsigned long soft_limit = READ_ONCE(pipe_user_pages_soft);
return soft_limit && user_bufs > soft_limit;
}
struct pipe_inode_info *alloc_pipe_info(void)
{
...
unsigned long pipe_bufs = PIPE_DEF_BUFFERS; // #define PIPE_DEF_BUFFERS 16
...
if (too_many_pipe_buffers_soft(user_bufs) && is_unprivileged_user()) {
user_bufs = account_pipe_buffers(user, pipe_bufs, 2);
pipe_bufs = 2;
}
if (too_many_pipe_buffers_hard(user_bufs) && is_unprivileged_user())
goto out_revert_acct;
pipe->bufs = kcalloc(pipe_bufs, sizeof(struct pipe_buffer),
GFP_KERNEL_ACCOUNT);
...
}
Creating pipe when , The kernel will pass through too_many_pipe_buffers_soft Check whether more than the current user can use pipe Capacity size . If it is found that it has exceeded , Then change the capacity from 16 individual PAGE_SIZE Adjust to 2 individual PAGE_SIZE. Execute on the machine getconf PAGESIZE Can be obtained PAGESIZE yes 4096 byte , That is to say, under normal circumstances pipe The size is 164096 byte , But due to exceeding the limit ,pipe The size is adjusted to 24096 byte , This may lead to data that cannot be written at one time pipe The problem of , Basically, the problem can be verified 2 guess .
thus ,pipe-user-pages-soft The relevant logic has also been straightened out , It is relatively easy to understand .
that , The problem is back to 「 Why containers root user pipe The capacity will exceed the limit 」.
100% recurrence
The first step to find the root cause of the problem , It is often the problem of offline environment recurrence .
Because the online environment has passed the scheme A Emergency repairs have been made , therefore , It is no longer possible to analyze problems online , We need to find a necessary means .
Everything comes to him who waits , stay issue Found the same problem in , And it can be reproduced in the following ways .
https://github.com/containerd/containerd/issues/5261
echo 1 > /proc/sys/fs/pipe-user-pages-soft
while true; do docker run -itd --security-opt=no-new-privileges nginx; done
After executing the above order , Appear immediately runc init Stuck condition , It is consistent with the online phenomenon . adopt lsof -p see runc init Open file handle :
[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-Dtp3TTy5-1656903865635)(https://upload-images.jianshu.io/upload_images/27822061-422c0db3717d9c8f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]
You can see fd4、fd5、fd6 All are pipe type , among ,fd4 Follow fd6 The numbers are all 415841, Is the same pipe. that , How to get pipe Size to actually verify 「 doubt 2」 What about the conjecture in ?Linux There is no ready-made tool available pipe size , But the kernel opens system calls fcntl(fd, F_GETPIPE_SZ) Can be obtained , The code is as follows :
#include <unistd.h>
#include <errno.h>
#include <stdio.h>
// Must use Linux specific fcntl header.
#include </usr/include/linux/fcntl.h>
int main(int argc, char *argv[]) {
int fd = open(argv[1], O_RDONLY);
if (fd < 0) {
perror("open failed");
return 1;
}
long pipe_size = (long)fcntl(fd, F_GETPIPE_SZ);
if (pipe_size == -1) {
perror("get pipe size failed.");
}
printf("pipe size: %ld\\n", pipe_size);
close(fd);
}
After compiling , see pipe The size is as follows :
Focus on fd4 Follow fd6, Two handles correspond to the same pipe, The obtained capacity is 8192 = 2 * PAGESIZE. So it's really because pipe Exceeding the soft limit results in pipe The capacity is adjusted to 2 * PAGESIZE.
Use A After the solution solves the problem , Let's take a look B programme .
Solution B
https://github.com/opencontainers/runc/pull/2871
The bug Is in runc v1.0.0-rc93 Introduced in , And in v1.0.0-rc94 Through the above PR Repair . that , How to repair online ? Is it necessary to docker All components are upgraded ?
If you put dockerd/containerd/runc If the components are upgraded , You need to cut off the business before upgrading , The whole process is relatively complicated , And the risk is high . And in this question , The only problem is runc, And only newly created containers are affected . Therefore, it is logical to consider whether it can be upgraded separately runc?
Because in Kubernetes v1.19 Not deprecated in version dockershim, Therefore, the whole call chain of the running container is :kubelet → dockerd → containerd → containerd-shim → runc → container. differ dockerd/containerd It is the server running in the background ,containerd-shim call runc, Actually called runc Binary to start the container . therefore , We just need to upgrade runc, For newly created containers , Will use the new version runc To run the container .
Verified in the test environment , It really won't happen runc init Stuck . Final , Gradually put online runc Upgrade to v1.1.1, And will /proc/sys/fs/pipe-user-pages-soft Adjust back to the original default value .runc hang The problem of housing is satisfactorily solved .
analysis & summary
PR What repairs have been made ?
Bug Reason . When the container is opened no-new-privileges after ,runc You will need to unload a loaded bpf Code , Then reload patch After bpf Code . stay bpf In the design of , You need to get the loaded bpf Code , Then you can use this code to call the uninstall interface . In obtaining bpf Code , The kernel is open seccomp_export_bpf function ,runc Adopted pipe As fd Handle passes parameters to get code , because seccomp_export_bpf Functions are synchronously blocked , The kernel writes code to fd In handle , therefore , If pipe If the size is too small , Will appear pipe Data cannot be written when it is full bpf The code causes a stuck condition .
PR Solution in . Start a goroutine To read in time pipe The content in , Instead of waiting for the data to be written and then read .
Why exceed the limit ?
Container of root user UID by 0, And the host root user UID It's also 0. In kernel statistics pipe When using the amount , Think it's the same user , There is no distinction . therefore , When runc init apply pipe when , The kernel judges that the current user has no privileges , Just search UID by 0 Users of pipe Usage quantity , Because the kernel counts all UID by 0 user ( Including the container ) pipe Sum of usage , So it's more than /proc/sys/fs/pipe-user-pages-soft Limitations in . And the actual container root user pipe The usage did not exceed the limit . This explains the question mentioned above 2.
So let's finally make a summary , The cause of this failure is , Operating system pair pipe-user-pages-soft There are soft restrictions , But because of the container root User UID Consistent with the host computer 0, Kernel statistics pipe There is no distinction between the usage amount , Lead to when UID by 0 Users of pipe After the usage exceeds the soft limit , Newly assigned pipe The capacity will become smaller . and runc 1.0.0-rc93 Just because pipe The capacity is too small , As a result, the data cannot be completely written , Write blocking , Keep waiting for synchronization , , in turn, runc init Get stuck ,kubelet pleg Abnormal state , node NotReady.
Repair plan ,runc adopt goroutine Read in time pipe Content , Prevent write blocking .
Reference material
https://iximiuz.com/en/posts/container-learning-path/
https://medium.com/@mccode/understanding-how-uid-and-gid-work-in-docker-containers-c37a01d01cf
https://man7.org/linux/man-pages/man7/pipe.7.html
https://gist.github.com/cyfdecyf/1ee981611050202d670c
https://github.com/containerd/containerd/issues/5261
https://github.com/opencontainers/runc/pull/2871
边栏推荐
- If the consumer Internet is compared to a "Lake", the industrial Internet is a vast "ocean"
- 【大型电商项目开发】性能压测-优化-中间件对性能的影响-40
- Wechat applet; Gibberish generator
- Global and Chinese market of optical densitometers 2022-2028: Research Report on technology, participants, trends, market size and share
- Database postragesql lock management
- Expansion operator: the family is so separated
- Es uses collapsebuilder to de duplicate and return only a certain field
- Are you still writing the TS type code
- C basic knowledge review (Part 3 of 4)
- Win: enable and disable USB drives using group policy
猜你喜欢
The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
整理混乱的头文件,我用include what you use
Wechat applet: independent background with distribution function, Yuelao office blind box for making friends
R语言用logistic逻辑回归和AFRIMA、ARIMA时间序列模型预测世界人口
Nebula Importer 数据导入实践
Great God developed the new H5 version of arXiv, saying goodbye to formula typography errors in one step, and mobile phones can also easily read literature
Wechat applet: the latest WordPress black gold wallpaper wechat applet two open repair version source code download support traffic main revenue
Database performance optimization tool
Roads and routes -- dfs+topsort+dijkstra+ mapping
小程序容器技术与物联网 IoT 可以碰撞出什么样的火花
随机推荐
Actual combat simulation │ JWT login authentication
Heartless sword English translation of Xi Murong's youth without complaint
Valentine's Day flirting with girls to force a small way, one can learn
[OpenGL learning notes 8] texture
流批一體在京東的探索與實踐
Wechat applet: the latest WordPress black gold wallpaper wechat applet two open repair version source code download support traffic main revenue
What is the current situation and Prospect of the software testing industry in 2022?
What is the length of SHA512 hash string- What is the length of a hashed string with SHA512?
Global and Chinese market of nutrient analyzer 2022-2028: Research Report on technology, participants, trends, market size and share
微信小程序:最新wordpress黑金壁纸微信小程序 二开修复版源码下载支持流量主收益
LeetCode周赛 + AcWing周赛(T4/T3)分析对比
Hand drawn video website
Huawei machine test question: longest continuous subsequence
线上故障突突突?如何紧急诊断、排查与恢复
如果消费互联网比喻成「湖泊」的话,产业互联网则是广阔的「海洋」
Win:使用组策略启用和禁用 USB 驱动器
Wechat applet: wechat applet source code download new community system optimized version support agent member system function super high income
The perfect car for successful people: BMW X7! Superior performance, excellent comfort and safety
Application and development trend of image recognition technology
What sparks can applet container technology collide with IOT