当前位置:网站首页>Runc hang causes the kubernetes node notready

Runc hang causes the kubernetes node notready

2022-07-05 01:39:00 Shoot cloud again

Kubernetes 1.19.3

OS: CentOS 7.9.2009

Kernel: 5.4.94-1.el7.elrepo.x86_64

Docker: 20.10.6

Say first conclusion ,runc v1.0.0-rc93 Yes bug, It can lead to docker hang live .

Find the problem

Online alarm indicates that there is 2-3 individual K8s The node is in NotReady The state of , also NotReady Status continues .

  • kubectl describe node, Yes NotReady Related events .

  • After logging in to the problem machine , Check the node load , Everything is all right .

  • see kubelet journal , Find out PLEG drawn-out , Causes the node to be marked NotReady.

  • docker ps normal .

  • perform ps Check the process , There are several runc init The process of .runc yes containerd Called when the container is started OCI Runtime Program . The initial suspicion is that docker hang Live in the .

There are two ways to solve this problem , So let's look at this first A programme .

Solution A

in the light of docker hang Live in such a phenomenon , After searching the information, I found that the following two articles also encountered similar problems :

  • docker hang Troubleshoot problems [https://www.likakuli.com/posts/docker-hang/]

  • Docker hung Housing problem analysis series ( One ):pipe Not enough capacity [https://juejin.cn/post/6891559762320703495]

The reason mentioned in both articles is pipe Insufficient capacity leads to runc init Go to pipe The writing is stuck , take /proc/sys/fs/pipe-user-pages-soft Let go of the restrictions on , Can solve the problem .

therefore , Check the problem host /proc/sys/fs/pipe-user-pages-soft The settings are 16384. So zoom it in 10 times echo 163840 > /proc/sys/fs/pipe-user-pages-soft, However kubelet Still not back to normal ,pleg The error log continues ,runc init The program did not exit .

in consideration of runc init yes kubelet call CRI Interface created , It may be necessary to runc init Exit to make kubelet sign out . According to the description in the article , Just put the corresponding pipe Read the contents in ,runc init You can quit . Because reading pipe The content of can be used 「UNIX/Linux Everything is a document 」 Principles , adopt lsof -p see runc init Open handle information , Get write type pipe The corresponding number ( There could be multiple ), Execute sequentially cat /proc/ p i d / f d / pid/fd/ pid/fd/id The way , Read pipe The content in . After a few attempts ,runc init Sure enough, I quit .

Check again , The node state is switched to Ready,pleg The error log also disappeared , No node appeared after observation for a day NotReady The situation of , problem ( temporary ) solve .

For solutions A doubt

Although the problem has been solved , But read carefully /proc/sys/fs/pipe-user-pages-soft Parameter description document , It is not difficult to find that this parameter is not quite right with the root cause of this problem .

pipe-user-pages-soft The meaning is no CAP_SYS_RESOURCE CAP_SYS_ADMIN Permission users use pipe Limit the capacity , By default, you can only use 1024 individual pipe, One pipe The capacity is 16k.

Then there is a question :

  • dockerd/containerd/kubelet And other components pass root User run , also runc init In container initialization stage , Theoretically, it will not 1024 individual pipe Consumed . therefore ,pipe-user-pages-soft It won't be right docker hang This problem has an impact , But the problem disappears after the actual parameters are amplified , There is no explanation .

  • pipe The capacity is fixed , The user is creating pipe Cannot declare capacity . Look at it online ,pipe It was indeed built , If the capacity is fixed , It should not be because users use pipe More than pipe-user-pages-soft Limit , Which leads to the problem of being unable to write . Is it newly created pipe The capacity becomes smaller , Result in data that can be written originally , Cannot write this time ?

  • At present pipe-user-pages-soft Magnified 10 times , Zoom in 2 Is it enough , Which value is the most appropriate value ?

Explore

The most direct way to locate the problem , Is to read the source code .

Check first Linux Kernel heel pipe-user-pages-soft Related code . The online kernel version is 5.4.94-1, Switch to the corresponding version for retrieval .

static bool too_many_pipe_buffers_soft(unsigned long user_bufs)
{
        unsigned long soft_limit = READ_ONCE(pipe_user_pages_soft);

        return soft_limit && user_bufs > soft_limit;
}

struct pipe_inode_info *alloc_pipe_info(void)
{
  ...
  unsigned long pipe_bufs = PIPE_DEF_BUFFERS;  // #define PIPE_DEF_BUFFERS        16
  ...

        if (too_many_pipe_buffers_soft(user_bufs) && is_unprivileged_user()) {
                user_bufs = account_pipe_buffers(user, pipe_bufs, 2);
                pipe_bufs = 2;
        }

        if (too_many_pipe_buffers_hard(user_bufs) && is_unprivileged_user())
                goto out_revert_acct;

        pipe->bufs = kcalloc(pipe_bufs, sizeof(struct pipe_buffer),
                             GFP_KERNEL_ACCOUNT);
  ...
}

Creating pipe when , The kernel will pass through too_many_pipe_buffers_soft Check whether more than the current user can use pipe Capacity size . If it is found that it has exceeded , Then change the capacity from 16 individual PAGE_SIZE Adjust to 2 individual PAGE_SIZE. Execute on the machine getconf PAGESIZE Can be obtained PAGESIZE yes 4096 byte , That is to say, under normal circumstances pipe The size is 164096 byte , But due to exceeding the limit ,pipe The size is adjusted to 24096 byte , This may lead to data that cannot be written at one time pipe The problem of , Basically, the problem can be verified 2 guess .

thus ,pipe-user-pages-soft The relevant logic has also been straightened out , It is relatively easy to understand .

that , The problem is back to 「 Why containers root user pipe The capacity will exceed the limit 」.

100% recurrence

The first step to find the root cause of the problem , It is often the problem of offline environment recurrence .

Because the online environment has passed the scheme A Emergency repairs have been made , therefore , It is no longer possible to analyze problems online , We need to find a necessary means .

Everything comes to him who waits , stay issue Found the same problem in , And it can be reproduced in the following ways .

https://github.com/containerd/containerd/issues/5261

echo 1 > /proc/sys/fs/pipe-user-pages-soft
while true; do docker run -itd --security-opt=no-new-privileges nginx; done

After executing the above order , Appear immediately runc init Stuck condition , It is consistent with the online phenomenon . adopt lsof -p see runc init Open file handle :

[ Failed to transfer the external chain picture , The origin station may have anti-theft chain mechanism , It is suggested to save the pictures and upload them directly (img-Dtp3TTy5-1656903865635)(https://upload-images.jianshu.io/upload_images/27822061-422c0db3717d9c8f.png?imageMogr2/auto-orient/strip%7CimageView2/2/w/1240)]

You can see fd4、fd5、fd6 All are pipe type , among ,fd4 Follow fd6 The numbers are all 415841, Is the same pipe. that , How to get pipe Size to actually verify 「 doubt 2」 What about the conjecture in ?Linux There is no ready-made tool available pipe size , But the kernel opens system calls fcntl(fd, F_GETPIPE_SZ) Can be obtained , The code is as follows :

#include <unistd.h>
#include <errno.h>
#include <stdio.h>
// Must use Linux specific fcntl header.
#include </usr/include/linux/fcntl.h>

int main(int argc, char *argv[]) {
    int fd = open(argv[1], O_RDONLY);
    if (fd < 0) {
        perror("open failed");
        return 1;
    }

    long pipe_size = (long)fcntl(fd, F_GETPIPE_SZ);
    if (pipe_size == -1) {
        perror("get pipe size failed.");
    }
    printf("pipe size: %ld\\n", pipe_size);

    close(fd);
}

After compiling , see pipe The size is as follows :

Focus on fd4 Follow fd6, Two handles correspond to the same pipe, The obtained capacity is 8192 = 2 * PAGESIZE. So it's really because pipe Exceeding the soft limit results in pipe The capacity is adjusted to 2 * PAGESIZE.

Use A After the solution solves the problem , Let's take a look B programme .

Solution B

https://github.com/opencontainers/runc/pull/2871

The bug Is in runc v1.0.0-rc93 Introduced in , And in v1.0.0-rc94 Through the above PR Repair . that , How to repair online ? Is it necessary to docker All components are upgraded ?

If you put dockerd/containerd/runc If the components are upgraded , You need to cut off the business before upgrading , The whole process is relatively complicated , And the risk is high . And in this question , The only problem is runc, And only newly created containers are affected . Therefore, it is logical to consider whether it can be upgraded separately runc?

Because in Kubernetes v1.19 Not deprecated in version dockershim, Therefore, the whole call chain of the running container is :kubelet → dockerd → containerd → containerd-shim → runc → container. differ dockerd/containerd It is the server running in the background ,containerd-shim call runc, Actually called runc Binary to start the container . therefore , We just need to upgrade runc, For newly created containers , Will use the new version runc To run the container .

Verified in the test environment , It really won't happen runc init Stuck . Final , Gradually put online runc Upgrade to v1.1.1, And will /proc/sys/fs/pipe-user-pages-soft Adjust back to the original default value .runc hang The problem of housing is satisfactorily solved .

analysis & summary

PR What repairs have been made ?

Bug Reason . When the container is opened no-new-privileges after ,runc You will need to unload a loaded bpf Code , Then reload patch After bpf Code . stay bpf In the design of , You need to get the loaded bpf Code , Then you can use this code to call the uninstall interface . In obtaining bpf Code , The kernel is open seccomp_export_bpf function ,runc Adopted pipe As fd Handle passes parameters to get code , because seccomp_export_bpf Functions are synchronously blocked , The kernel writes code to fd In handle , therefore , If pipe If the size is too small , Will appear pipe Data cannot be written when it is full bpf The code causes a stuck condition .

PR Solution in . Start a goroutine To read in time pipe The content in , Instead of waiting for the data to be written and then read .

Why exceed the limit ?

Container of root user UID by 0, And the host root user UID It's also 0. In kernel statistics pipe When using the amount , Think it's the same user , There is no distinction . therefore , When runc init apply pipe when , The kernel judges that the current user has no privileges , Just search UID by 0 Users of pipe Usage quantity , Because the kernel counts all UID by 0 user ( Including the container ) pipe Sum of usage , So it's more than /proc/sys/fs/pipe-user-pages-soft Limitations in . And the actual container root user pipe The usage did not exceed the limit . This explains the question mentioned above 2.

So let's finally make a summary , The cause of this failure is , Operating system pair pipe-user-pages-soft There are soft restrictions , But because of the container root User UID Consistent with the host computer 0, Kernel statistics pipe There is no distinction between the usage amount , Lead to when UID by 0 Users of pipe After the usage exceeds the soft limit , Newly assigned pipe The capacity will become smaller . and runc 1.0.0-rc93 Just because pipe The capacity is too small , As a result, the data cannot be completely written , Write blocking , Keep waiting for synchronization , , in turn, runc init Get stuck ,kubelet pleg Abnormal state , node NotReady.

Repair plan ,runc adopt goroutine Read in time pipe Content , Prevent write blocking .

Reference material

https://iximiuz.com/en/posts/container-learning-path/

https://medium.com/@mccode/understanding-how-uid-and-gid-work-in-docker-containers-c37a01d01cf

https://man7.org/linux/man-pages/man7/pipe.7.html

https://gist.github.com/cyfdecyf/1ee981611050202d670c

https://github.com/containerd/containerd/issues/5261

https://github.com/opencontainers/runc/pull/2871

原网站

版权声明
本文为[Shoot cloud again]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/186/202207050136226766.html