当前位置：网站首页>Design and practice of kubernetes cluster and application monitoring scheme

Design and practice of kubernetes cluster and application monitoring scheme

2022-07-05 02:43:00 【Fool Gongliang】

Kubernetes monitor
practice

Kubernetes monitor

When your application is deployed to Kubenetes after , It's hard to see what's going on inside the container , Once the container dies , The data inside may never be recovered , You can't even view the log to locate the problem , Besides, there may be many instances of an application , A user's request does not specify which container is processed , This makes it possible to Kubernetes It is complicated to troubleshoot the application in . Beyond the application , because Kubernetes As infrastructure , In charge of the life and death of the whole cluster ,Kubernetes Any faults in the , It must affect the operation of application services , So monitor Kubernetes Health is also critical .

When your application is on cloud native , Then you have to pay attention to the running state of each server , Operational status of infrastructure and middleware ,Kubernetes The running state of each component and resource object in the , The running state of each application . Of course , This running state is a vague concept , It depends on our focus , What each monitored object wants to express " Running state " It's different . In order to monitor the objects we focus on , The object needs to make some cooperation , Provide appropriate expression information of operation state , For us to collect and analyze , This can be called observability .

In cloud native , Observability is generally divided into three scopes ：

You can Kubernetes Documentation on how to monitor 、 debugging , And learn how to handle logs ：
https://v1-20.docs.kubernetes.io/docs/tasks/debug-application-cluster/

In this paper , The monitoring mentioned , Only include Metrics .

Metrics、Tracing、Logging Not completely independent , In the diagram above ,Metrics It may also include Logging and Tracing Information about .

Monitoring object

Monitoring data to be collected , From the monitored object , And in the Kubernetes In the cluster , We can divide the objects to be monitored into three parts ：

machine ： All node machines in the cluster , The indicators are CPU Memory usage 、 Networks and hard drives IO Speed, etc ;
Kubernetes Object state ：Deployments, Pods, Daemonsets, Statefulset And other object status and some index information ;
application ：Pod Status or indicator of each container in the , And what the container itself may provide /metrics Endpoint .

Prometheus

In the basic environment , A complete monitoring should include collecting data 、 Store the data 、 Analyze stored data 、 Display data 、 Alarm and other parts , Each part has relevant tools or technologies to solve the diverse needs and complexity of the cloud native environment .

Since we need to monitor , Then you need monitoring tools . The monitoring tool can obtain all important indicators and logs (Metrics You can also include some logs ), And store them in a secure 、 Centralized location , So that you can access them at any time to develop solutions to problems . Because in the primary cloud , Apply to Kubernetes Deployment in cluster , therefore , monitor Kubernetes It can give you an in-depth understanding of the health and performance indicators of the cluster 、 A top-level overview of the resource count and the internal situation of the cluster . When something goes wrong , Monitoring tools will remind you （ Alarm function ）, So that you can quickly launch the fix .

Prometheus It's a CNCF project , You can monitor Kubernetes、 Nodes and Prometheus In itself , at present Kubernetes Official documents mainly recommend Prometheus In itself , It's for Kubernetes The container orchestration platform provides out of the box monitoring capabilities . So in this paper , The design of monitoring scheme is around Prometheus In the .

Here is Prometheus Some of the components of ：

Metric Collection: Prometheus Use the pull model to pass HTTP Retrieval metrics . stay Prometheus When indicators cannot be obtained , You can choose to use Pushgateway Push the indicator to Prometheus .
Metric Endpoint: Want to use Prometheus The monitoring system should expose a certain / Measure endpoint metrics , Prometheus Use this endpoint to extract indicators at fixed intervals .
PromQL: Prometheus Incidental PromQL, This is a very flexible query language , Can be used to query Prometheus Indicators in the dashboard . Besides ,Prometheus UI and Grafana Will use PromQL Query to visualize metrics .
Prometheus Exporters: There are many libraries and servers that can help export existing metrics from third-party systems to Prometheus indicators . This is not for direct use Prometheus Indicators detect the condition of a given system .
TSDB (time-series database): Prometheus Use TSDB Store all data efficiently . By default , All data is stored locally . However , To avoid a single point of failure ,prometheustsdb You can choose to integrate remote storage .

Prometheus stay Kubernetes The structure of the monitoring scheme in is as follows ：

【 Picture source ：https://devopscube.com/setup-prometheus-monitoring-on-kubernetes/】

indicators

There are many kinds of objects to monitor , We call an object of the same type an entity , The data generated by each entity runtime object has a variety of , In order to summarize and collect these data , Prometheus The various attribute values in the entity are divided into Counter ( Counter )、Gauge ( The dashboard )、Histogram( cumulative histogram )、Summary（ Abstract ） Four types , Each attribute in the entity , It's called the indicator , for example The container has been used accumulatively CPU The amount , Use indicator name container_cpu_usage_seconds_total Record .

The general format of each indicator is ：

 Index name { Metadata = value }  Index value

Every object generates data all the time , To distinguish which object the current indicator value belongs to , You can give the index in addition to the index value , Attach a lot of metadata information , The following example shows .

container_cpu_usage_seconds_total{
	beta_kubernetes_io_arch = "amd64",
	  beta_kubernetes_io_os = "linux", 
	  container = "POD", 
	  cpu = "total", 
	  id = "...", 
	  image = "k8s.gcr.io/pause:3.5", 
	  instance = "slave1", 
	  job = "kubernetes-cadvisor", 
	  kubernetes_io_arch = "amd64", 
	  kubernetes_io_hostname = "slave1",
	  kubernetes_io_os = "linux", 
	  name = "k8s_POD_pvcpod_default_02ed547b-6279-4346-8918-551b87877e91_0", 
	  namespace = "default", 
	  pod = "pvcpod"
}

After the object generates text with a structure similar to this , Can expose metrics Endpoint , Give Way Prometheus Automatic acquisition , Or through Pushgateway Pushed to the Prometheus in .

Next , We will be in Kubernetes Build a complete Prometheus Monitoring system .

practice

Node monitoring

References to this chapter ：https://devopscube.com/node-exporter-kubernetes/

node exporter Yes, it is Golang Compiling , Used in Linux On the system , Collect all hardware and operating system level indicators exposed by the kernel , Include CPU 、 Information 、 Network traffic 、 System load 、socket 、 Machine configuration, etc .

Readers can refer to the list of https://github.com/prometheus/node_exporter All indicators listed in the list that are on or off by default .

Since each node in the cluster needs to be monitored , Then it is necessary to ensure that each node runs this one node exporter example , And automatically schedule a node when the cluster adds a node node exporter Run into this node , So we need to use node exporter Your deployment requires DaemontSet Pattern .

View all nodes in the cluster ：

[email protected]:~# kubectl get nodes
NAME     STATUS                     ROLES                  AGE     VERSION
master   Ready,SchedulingDisabled   control-plane,master   98d     v1.22.2
salve2   Ready                      <none>                 3h50m   v1.23.3
slave1   Ready                      <none>                 98d     v1.22.2

Bibin Wilson The boss has encapsulated it for Kubernetes Of node exporter Of YAML file , We can download it directly ：

git clone https://github.com/bibinwilson/kubernetes-node-exporter

Open... In the warehouse daemonset.yaml file , Get a rough idea of the information .

stay YAML In file , You can see node exporter Will be deployed to the namespace monitoring Run in , It has two label：

   labels:
     app.kubernetes.io/component: exporter
     app.kubernetes.io/name: node-exporter

in order to node exporter Can be scheduled to master Run in node , We need to Pod Add tolerance attribute ：

  template:
    metadata:
      labels:
        app.kubernetes.io/component: exporter
        app.kubernetes.io/name: node-exporter
    spec:
    #  Copy the following part to the corresponding location 
      tolerations:
      - key: "node-role.kubernetes.io/master"
        operator: "Exists"
        effect: "NoSchedule"
      - key: "node.kubernetes.io/unschedulable"
        operator: "Exists"
        effect: "NoSchedule"

For deployment node exporter , Let's create a namespace first ：

kubectl create namespace monitoring

Execute the command to deploy node exporter：

kubectl create -f daemonset.yaml

see node exporter example

[email protected]:~# kubectl get daemonset -n monitoring
NAME            DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
node-exporter   3         3         3       3            3           <none>          22h

because node exporter Pod Scattered in various nodes , For convenience Prometheus Collect these node exporter Of Pod IP, Need to create Endpoint Unified collection , Here, by creating Service Automatic generation Endpoint To achieve the goal .

Check the... Under the warehouse service.yaml file , Its definition is as follows ：

kind: Service
apiVersion: v1
metadata:
  name: node-exporter
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9100'
spec:
  selector:
      app.kubernetes.io/component: exporter
      app.kubernetes.io/name: node-exporter
  ports:
  - name: node-exporter
    protocol: TCP
    port: 9100
    targetPort: 9100

this Service The selectors are as follows ：
selector:
   app.kubernetes.io/component: exporter
   app.kubernetes.io/name: node-exporter

establish Service：

kubectl create -f service.yaml

see Endpoint Collected node exporter Of Pod IP：

[email protected]:~# kubectl get endpoints -n monitoring 
NAME                    ENDPOINTS                                       AGE
node-exporter           10.32.0.27:9100,10.36.0.4:9100,10.44.0.3:9100   22h

node exporter In addition to collecting various indicator data , No more .

Deploy Prometheus

Reference in this section https://devopscube.com/setup-prometheus-monitoring-on-kubernetes/

There is now a node exporter , It can collect various indicators of nodes , The next step is to Kubernetes Infrastructure metrics data collection .

Kubernetes They provide a lot of metrics data , There are three endpoints /metrics/cadvisor, /metrics/resource and /metrics/probes .

With /metrics/cadvisor For example ,cAdvisor Analyze the memory of all containers running on a given node 、CPU、 Indicators of file and network usage , You can refer to https://github.com/google/cadvisor/blob/master/docs/storage/prometheus.md understand cAdvisor All the indicators of .

Other information ：
Source location ：https://github.com/kubernetes/metrics/blob/master/pkg/apis/metrics/v1beta1/types.go
Kubernetes Monitoring architecture design ：https://github.com/kubernetes/design-proposals-archive

In this section , The deployment of Prometheus Will be right kubenetes Do the following to collect metrics data ：

Kubernetes-apiservers: from API The server gets all the indicators ;
Kubernetes node : It collects all kubernetes Node index ;
kubernetes-pods: pod Add... To the metadata prometheus.io/scrape and prometheus.io/port notes , be-all pod Indicators will be found ;
kubernetes-cadvisor: Collect all cAdvisor indicators , Container related ;
Kubernetes-Service-endpoints: If the service metadata uses prometheus.io/scrape Annotation and prometheus.io/port notes , Then all service endpoints will be discarded ;

Bibin Wilson The boss has encapsulated the relevant deployment definition file , We can download it directly ：

git clone https://github.com/bibinwilson/kubernetes-prometheus

Prometheus By using Kubernetes API Server , obtain Each node 、Pod、Deployment Wait for all available indicators . therefore , We need to create a that has the right API Only Read access rights Of RBAC Strategy , And bind the policy to the monitoring namespace , To limit Prometheus Pod Only right API Read it .

see clusterRole.yaml file , You can the list of resource objects to be monitored ：

- apiGroups: [""]
  resources:
  - nodes
  - nodes/proxy
  - services
  - endpoints
  - pods
  verbs: ["get", "list", "watch"]
- apiGroups:
  - extensions
  resources:
  - ingresses

Create roles and role bindings in the cluster ：

kubectl create -f clusterRole.yaml

Can pass Through command line flags and configuration files Yes Prometheus To configure . Although the command line flag is configured with immutable system parameters （ For example, storage location 、 The amount of data to be saved in disk and memory, etc ）, But the configuration file defines everything related to the grab job and its instances , And which rule files to load , So deploy Permetheus File configuration is indispensable .

Permetheus Configuration file to YAML Format to write , Specific rules can be referred to ：https://prometheus.io/docs/prometheus/latest/configuration/configuration/

To facilitate mapping configuration files to Permetheus Pod in , We need to put the configuration into configmap , Then mount to Pod, The configuration content can be viewed config-map.yaml .config-map.yaml Many rules for collecting data sources are defined in , For example collection Kubernetes Clusters and node exporter , Configuration reference ：

    scrape_configs:
      - job_name: 'node-exporter'
        kubernetes_sd_configs:
          - role: endpoints
        relabel_configs:
        - source_labels: [__meta_kubernetes_endpoints_name]
          regex: 'node-exporter'
          action: keep

You can open it https://raw.githubusercontent.com/bibinwilson/kubernetes-prometheus/master/config-map.yaml Preview this file online .

establish configmap：

kubectl create -f config-map.yaml

This configuration is very important , It needs to be configured according to the actual situation , It is generally handled by the operation and maintenance department , No more discussion here .

Next, we will deploy Prometeus , Because... Is used in the sample file emtpy Volume storage Prometheus data , So once Pod Restart, etc. , Data will be lost , So this can be changed to hostpath volume .

open prometheus-deployment.yaml file ：

take

          emptyDir: {}

Change to

          hostPath:
            path: /data/prometheus
            type: Directory

It can be changed but not changed .
If you change it , Need to be scheduled for this Pod Create on the corresponding node /data/prometheus Catalog .

Deploy Prometeus ：

kubectl create  -f prometheus-deployment.yaml

View deployment status ：

[email protected]:~# kubectl get deployments --namespace=monitoring
NAME                    READY   UP-TO-DATE   AVAILABLE   AGE
prometheus-deployment   1/1     1            1           23h

In order to visit Prometeus , Need to create Service：

apiVersion: v1
kind: Service
metadata:
  name: prometheus-service
  namespace: monitoring
  annotations:
      prometheus.io/scrape: 'true'
      prometheus.io/port:   '9090'
  
spec:
  selector: 
    app: prometheus-server
  type: NodePort  
  ports:
    - port: 8080
      targetPort: 9090 
      nodePort: 30000

kubectl create -f prometheus-service.yaml

Next you can visit Prometeus UI panel .

Click on Graph, Click on the icon , Select the indicator value to be displayed , Click again Execute Query display .

You can also Service Discobery in , see Prometheus Collected metrics data source .

If your cluster has not been installed kube-state-metrics, Then the data source will display a red mark , In the next section , Let's continue to deploy this component .

thus , Our monitoring structure is as follows ：

Deploy Kube State Metrics

References in this section ：https://devopscube.com/setup-kube-state-metrics/

Kube State metrics It's a service , It is associated with Kubernetes API Server signal communication , To get all API Object details , Such as Deployment、Pod etc. .

Kube State metrics Provides information that cannot be accessed directly from the local network Kubernetes Monitor the information obtained by the component Kubernetes Objects and resources Measure , because Kubenetes Metrics The indicators provided by itself are not very comprehensive , Therefore need Kube State Metrics In order to obtain and kubernetes All metrics related to the object .

Here are some examples from Kube State metrics Some important metrics obtained in ：

Node status, node capacity (CPU and memory)
Replica-set compliance (desired/available/unavailable/updated status of replicas per deployment)
Pod status (waiting, running, ready, etc)
Ingress metrics
PV, PVC metrics
Daemonset & Statefulset metrics.
Resource requests and limits.
Job & Cronjob metrics

You can view the supported detailed indicators in the document here ：https://github.com/kubernetes/kube-state-metrics/tree/master/docs

Bibin Wilson The boss has encapsulated the relevant deployment definition file , We can download it directly ：

git clone https://github.com/devopscube/kube-state-metrics-configs.git

Directly apply all YAML Create corresponding resources ：

kubectl apply -f kube-state-metrics-configs/

├── cluster-role-binding.yaml
├── cluster-role.yaml
├── deployment.yaml
├── service-account.yaml
└── service.yaml

The resources created above , It includes the following parts , This section , Don't start the explanation .

Service Account
Cluster Role
Cluster Role Binding
Kube State Metrics Deployment
Service

Use the following command to check the deployment status ：

kubectl get deployments kube-state-metrics -n kube-system

And then , Refresh Prometheus Service Discobery , You can see that red turns blue , Click this data source , You can see the following information ：

- job_name: 'kube-state-metrics'
  static_configs:
    - targets: ['kube-state-metrics.kube-system.svc.cluster.local:8080']

This configuration is kube-state-metrics Access address of .

Here it is , We deployed Prometeus The structure is as follows ：

Deploy Grafana

References in this section ：https://devopscube.com/setup-grafana-kubernetes/

After the deployment of the previous sections , We have done a good job in data source collection and data storage , Next we will deploy Grafana, utilize Grafana Analyze and visualize the index data .

Bibin Wilson The boss has encapsulated the relevant deployment definition file , We can download it directly ：

git clone https://github.com/bibinwilson/kubernetes-grafana.git

First of all to see grafana-datasource-config.yaml file , This configuration is for Grafana Automatically configure Prometheus data source .

There is also a very important address ：

                "url": "http://prometheus-service.monitoring.svc:8080",

First you have to use the command test curl http://prometheus-service.monitoring.svc:8080, See if you can get the response data , If appear ：

[email protected]:~/jk/kubernetes-prometheus# curl http://prometheus-deployment.monitoring.svc:8080
curl: (6) Could not resolve host: prometheus-deployment.monitoring.svc
[email protected]:~/jk/kubernetes-prometheus# curl http://prometheus-deployment.monitoring.svc.cluster.local:8080
curl: (6) Could not resolve host: prometheus-deployment.monitoring.svc.cluster.local

It could be you coredns Not installed or for other reasons , As a result, you cannot access... Through this address Prometheus , In order to avoid excessive operation , You can use IP, Not domain names .

see Prometheus Of Service IP：

[email protected]:~/jk/kubernetes-prometheus# kubectl get svc -n monitoring
NAME                    TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
prometheus-deployment   NodePort    10.105.95.8     <none>        9090:32330/TCP   23h

The test passed Service IP Whether the access is normal

[email protected]:~/jk/kubernetes-prometheus# curl 10.105.95.8:9090
<a href="/graph">Found</a>.

take grafana-datasource-config.yaml Medium prometheus-deployment.monitoring.svc.cluster.local:8080 Change to the corresponding Service IP, And the port is changed to 9090.

Create a configuration

kubectl create -f grafana-datasource-config.yaml

open deployment.yaml View definition , In the template grafana Data storage is also used empty volume , There is a risk of data loss , So you can use hospath Or other types of volume storage . Refer to the configuration of the author ：

      volumes:
        - name: grafana-storage
          hostPath:
            path: /data/grafana
            type: Directory

Deploy Grafana：

kubectl create -f deployment.yaml

Then create Service：

kubectl create -f service.yaml

Then you can go through 32000 Port access Grafana.

The account and password are admin

thus , We deployed Prometheus The monitoring structure is as follows ：

When I first went in, it was empty , We need to use chart template to make visual interface , To show beautiful data .

stay Grafana On the official website , There are many free templates made by the community https://grafana.com/grafana/dashboards/?search=kubernetes

Start by opening https://grafana.com/grafana/dashboards/8588 Download this template , Then upload the template file , And bind the corresponding Prometheus data source .

Next, you can see the corresponding monitoring interface .

You can open it Browse , Continue importing more templates , Then view the template monitoring interface to be displayed .

How does the application access Prometheus and Grafana

The monitoring of infrastructure has been mentioned earlier , We can also use middleware such as TIDB、Mysql And so on 、 Collect indicator data , You can also customize the indicator data in the program , Then make it yourself Grafana Templates . If you are .NET Development , You can also refer to another article of the author to understand these processes step by step ：https://www.cnblogs.com/whuanle/p/14969982.html

The alarm

In the monitoring system , Alarm is the top priority , Generally, it is necessary to develop alarm processing and push notification components according to the actual situation of the company .

We suggest you read be based on Rob Ewaschuk stay Google Of observation My alarm Philosophy https://docs.google.com/a/boxever.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

Deploy in the front Prometheus when ,config-map.yaml An alarm rule has been defined .

  prometheus.rules: |-
    groups:
    - name: devopscube demo alert
      rules:
      - alert: High Pod Memory
        expr: sum(container_memory_usage_bytes) > 1
        for: 1m
        labels:
          severity: slack
        annotations:
          summary: High Memory Usage

An alarm rule is mainly composed of the following parts ：
alert： Name of alarm rule .
expr： be based on PromQL Expression alarm trigger condition , It is used to calculate whether a time series satisfies the condition .
for： Evaluate the waiting time , Optional parameters . It is used to indicate that the alarm will be sent only after the trigger condition lasts for a period of time . The status of the newly generated alarm during the waiting period is pending.
labels： Custom tag , Allows the user to specify a set of additional labels to be attached to the alarm .
annotations： Used to specify a set of additional information , For example, the text used to describe the alarm details ,annotations When the alarm is generated, the contents of will be sent to as parameters Alertmanager.
May refer to ：https://yunlzheng.gitbook.io/prometheus-book/parti-prometheus-ji-chu/alert/prometheus-alert-rule

stay Grafana You can also see this rule in .

Let's configure alarm notification in the future .

First, create an alarm contact , The author used nails Webhook.

And find Alert Rules, Add a new alarm rule .

Please refer to... For alarm rules ：https://grafana.com/docs/grafana/latest/alerting/unified-alerting/alerting-rules/create-grafana-managed-rule/

Then open Notification policies, Bind the alarm rules and contact information , The qualified alarm information will be pushed to the specified contact information .

stay Alert Rules You can see the push record of alarm information in . Because the author's server is abroad , The nail of the server may not be used Webhook function , So here has been Pending, Therefore, the author will not try too much here , The reader can understand the general steps .