当前位置：网站首页>Disk troubleshooting of kubernetes node

Disk troubleshooting of kubernetes node

2022-07-27 14:43:00 【New titanium cloud suit】

The new titanium cloud service has been accumulatively shared with you 667 Technical dry goods

Through this paper , You will learn about Kubernetes The correct handling method when the node encounters disk pressure , Including the cause of disk pressure and every step of troubleshooting .

No matter what application you are running , All need some basic resources .CPU、 Memory and disk space are common , Will be used for all applications . Most engineers are interested in how to deal with CPU And memory have a correct understanding , But not everyone takes the time to understand how to use disks correctly .

stay Kubernetes Environment , as time goes on , This could be catastrophic , Because once overloaded ,Kubernetes Will start “ save ” own . This is by killing pod To achieve , Thus reducing the load on the node . If the application does not know how to handle sudden exceptions correctly , This can lead to problems , Or it may result in insufficient resources to handle a given load .

Through this paper , We can well understand and deal with similar disk failures .

What is? Node Disk Pressure

Node disk pressure, seeing the name of a thing one thinks of its function , The disk connected to the node is under pressure . You are unlikely to encounter Node disk pressure, because Kubernetes Some measures are built in to avoid it , But it does happen from time to time . Although there are many factors that can lead to Node disk pressure, But you may encounter two main reasons .

You may encounter Node disk pressure The first reason is Kubernetes Unused images are not cleaned up in time . By default , It shouldn't have happened , because Kubernetes Regularly check whether there are unused images , And then delete it . This is unlikely to be the source of node disk pressure ; however , This should be kept in mind .

Another problem you are likely to encounter is the accumulation of logs .Kubernetes The default behavior in is to save the log in two cases ： It will save the log of any running container , And save the log of the recently exited container , To help troubleshoot . This is an attempt to strike a balance between keeping important logs and deleting useless logs over time . however , If you have a long-running container with a large number of logs , Then these logs may accumulate enough to overload the capacity of the node disk .

Find out exactly what the problem is , You need to find out which files take up the most space .

Troubleshooting node disk pressure

To solve the problem of node disk pressure , You need to figure out which files take up the most space . because Kubernetes stay Linux Up operation , So you can run du The command is done easily . You can manually go through SSH Connect to each Kubernetes node , You can also use DaemonSet(https://www.containiq.com/post/using-kubernetes-daemonsets-effectively).

Deployment and understanding DaemonSet

To deploy DaemonSet, You can use DaemonSet Of GitHub Gist(https://gist.githubusercontent.com/omerlh/cc5724ffeea17917eb06843dbff987b7/raw/1e58c8850aeeb6d22d8061338f09e5e1534ab638/daemonset.yaml) , You can also create a file that contains the following ：

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: disk-checker
  labels:
    app: disk-checker
spec:
  selector:
    matchLabels:
      app: disk-checker
  template:
    metadata:
      labels:
        app: disk-checker
    spec:
      hostPID: true
      hostIPC: true
      hostNetwork: true
      containers:
      - resources:
          requests:
            cpu: 0.15
        securityContext:
          privileged: true
        image: busybox
        imagePullPolicy: IfNotPresent
        name: disk-checked
        command: ["/bin/sh"]
        args: ["-c", "du -a /host | sort -n -r | head -n 20"]
        volumeMounts:
        - name: host
          mountPath: "/host"
      volumes:
        - name: host
          hostPath:
            path: "/"

Now you can run the following command ：

$ kubectl apply -f https://gist.githubusercontent.com/omerlh/cc5724ffeea17917eb06843dbff987b7/raw/1e58c8850aeeb6d22d8061338f09e5e1534ab638/daemonset.yaml

In the use of DaemonSet Before troubleshooting , It is important to understand what happened . If you look at the manifest file above , You will notice that it is actually a very simple service . Many of them are template files , But the important thing to note is command and args Field . This is the setup du Where the command runs , Then before printing 20 results . following , You can also see that the host volume is in the path /host Bind to container at .

Use DaemonSet

First , You need to make sure DaemonSet Deployed correctly , You can run kubectl get pods -l app=disk-checker To complete . This should produce and output the following ：

$ kubectl get pods -l app=disk-checker


NAME READY STATUS RESTARTS AGE
disk-checker-bwkbj 1/1 Running 0 2s‍

What you see here pod The number depends on the number of nodes running in the cluster . After confirming that the node is running , You can perform kubectl logs -l app=disk-checker Start checking the running pod Log . This may take some time , But finally you should see a list of files and their sizes , This will give you a deeper understanding of what takes up space on the node . What you want to do next depends on the file that takes up space —— You need to check DaemonSet And understand what is happening , And whether it is a log file 、 Application files or other files that are using your disk space .

Possible solutions

Analysis and understanding DaemonSet The output of is very important , We can solve the current problem from it . There are two possible solutions .

You may find that the problem is caused by application data , Therefore, the file cannot be deleted . under these circumstances , You will have to increase the size of the node disk to ensure that there is enough space to store application files . This is a relatively simple solution , But it will increase the cost of running the cluster . therefore , A better way is to first look at the structure of the application , See if you can find ways to reduce dependence on application files , Thus reducing the overall demand for disk usage .

On the other hand , You may find that your application generates a large number of files that are no longer needed . under these circumstances , It's as simple as deleting unnecessary files . According to the way your application is set up in terms of availability , You may just need to restart pod, Which leads to Kubernetes Automatically clean up any files in the container . Please note that , This is only done when using temporary volumes , Instead of using persistent volumes .

Last

up to now , You should know what this means when you encounter node disk pressure problems , And what your immediate thoughts should be when you encounter problems ： Collect relevant error logs .

You may have to upgrade the size of the disks in the cluster , Or clean up unused files . No matter the problem or the solution , You can now better understand this problem .

Learn about the new titanium cloud service

· The new titanium cloud service has become the first one in China to win Gartner Customer First Badge of cloud and security management service provider ！

· The new titanium cloud suit won the “2022 Love analysis · IT Panoramic report of operation and maintenance manufacturer ” Cloud management platform CMP On behalf of the manufacturer ！

· New titanium cloud clothing won the fourth FMCG Retail consumer goods industry CIO The annual meeting of the 「 The most trusted brand award for digital services of the year 」

· New titanium cloud clothes A Round financing tens of millions of Yuan ！ It is favored by both capital and customers ！

· The new titanium cloud suit is three years old , The company's monthly revenue exceeds 600 Ten thousand yuan , Set the development goal of new titanium for a century

· When IPFS Meet cloud services | New titanium cloud service has reached a strategic agreement with glacier distributed Laboratory

· I'm going deep into my major , Standing at the head of the turtle , The new titanium cloud service won tens of millions Pre-A Round of funding

· The first anniversary of the new titanium cloud service , Complete two rounds of financing , Serve more than 50 customers

Previous technical dry goods

· Ten thousand words long text ： Cloud Architecture Design Principles | attach PDF download

· Ten thousand words long text | Use RBAC Restricted pair Kubernetes Access to resources