当前位置：网站首页>DO280OpenShift命令及故障排查--常见故障排除和章节实验

DO280OpenShift命令及故障排查--常见故障排除和章节实验

2022-06-23 03:52:00 【IT民工金鱼哥】

个人简介：大家好，我是 金鱼哥，CSDN运维领域新星创作者，华为云·云享专家，阿里云社区·专家博主
个人资质：CCNA、HCNP、CSNA（网络分析师），软考初级、中级网络工程师、RHCSA、RHCE、RHCA、RHCI、ITIL
格言：努力不一定成功，但要想成功就必须努力
支持我：可点赞、可收藏️、可留言

常见环境信息

使用RPM安装的OCP，那么master和node的ocp相关服务将作为Red Hat Enterprise Linux服务运行。从master和node使用标准的sosreport实用程序，收集关于环境的信息，以及docker和openshift相关的信息。

[[email protected] ~]# sosreport -k docker.all=on -k docker.logs=on

sosreport命令创建一个包含所有相关信息的压缩归档文件，并将其保存在/var/tmp目录中。

另一个有用的诊断工具是oc adm diagnostics命令，能够在OpenShift集群上运行多个诊断检查，包括network、日志、内部仓库、master节点和node节点的服务检查等等。oc adm diagnostics --help命令，获取帮助。

常见诊断命令

oc客户端命令是用来检测和排除OpenShift集群中的问题的主要工具。它有许多选项，能够检测、诊断和修复由集群管理的主机和节点、服务和资源的问题。若已授权所需的权限，可以直接编辑集群中大多数托管资源的配置。

oc get events

事件允许OpenShift记录集群中生命周期事件的信息，以统一的方式查看关于OpenShift组件的信息。oc get events命令提供OpenShift namespace的事件信息，可实现以下事件的捕获：

- Pod创建和删除
- pod调度的节点
- master和node节点的状态

事件通常用于故障排除，从而获得关于集群中的故障和问题的高级信息，然后使用日志文件和其他oc子命令进一步定位。

示例：使用以下命令获得特定项目中的事件列表。

[[email protected] ~]$ oc get events -n <project>

也可以通过Web控制台进行事件的查看events。

oc log

oc logs命令查看build、deployment或pod的日志输出，。

示例1：使用oc命令查看pod的日志。

[[email protected] ~]$ oc logs pod

示例2：使用oc命令查看build的日志。

[[email protected] ~]$ oc logs bc/build-name

使用oc logs命令和-f选项实时跟踪日志输出。例如，这对于连续监视build的进度和检查错误非常有用。

也可以通过Web控制台进行事件的查看log。

oc rsync

oc rsync命令将内容复制到正在运行的pod中的目录或从目录复制内容。如果一个pod有多个容器，可以使用-c选项指定容器ID。否则，它默认为pod中的第一个容器。通常用于从容器传输日志文件和配置文件。

示例1：将pod目录中的内容复制到本地目录。

[[email protected] ~]$ oc rsync <pod>:<pod_dir> <local_dir> -c <container>

示例2：将内容从本地目录复制到pod的目录中。

[[email protected] ~]$ oc rsync <local_dir> <pod>:<pod_dir> -c <container>

oc port-forward

使用oc port-forward命令将一个或多个本地端口转发到pod。这允许在本地监听特定或随机端口，并将数据转发到pod中的特定端口。

示例1：本地监听3306并转发到pod的3306.

[[email protected] ~]$ oc port-forward <pod> 3306:3306

常见故障

资源限制和配额问题

对于设置了资源限制和配额的项目，不适当的资源配置将导致部署失败。使用oc get events和oc describe命令来排查失败的原因。

例如试图创建超过项目中pod数量配额限制的pod数量，那么在运行oc get events命令时会提示：

Warning FailedCreate {
    hello-1-deploy} Error creating: pods "hello-1" is forbidden:
exceeded quota: project-quota, requested: cpu=250m, used: cpu=750m, limited: cpu=900m

S2I build失败

使用oc logs命令查看S2I构建失败。例如，要查看名为hello的构建配置的日志:

[[email protected] ~]$ oc logs bc/hello

例如可以通过在build configuration策略中指定BUILD_LOGLEVEL环境变量来调整build日志的详细程度。

{
    
"sourceStrategy": {
    
...
"env": [
{
    
"name": "BUILD_LOGLEVEL",
"value": "5"
}
]
}
}

ErrImagePull和imgpullback错误

通常是由不正确的deployment configuration造成、部署期间引用的错误或缺少image或Docker配置不当造成。

使用oc get events和oc describe命令排查，通过使用**oc edit dc/**编辑deployment configuration来修复错误。

docker配置异常

master和node上不正确的docker配置可能会在部署期间导致许多错误。

通常检查ADD_REGISTRY、INSECURE_REGISTRY和BLOCK_REGISTRY设置。使用systemctl status, oc logs, oc get events和oc describe命令对问题进行排查。

可以通添加**/etc/sysconfig/docker配置文件中的–log-level**参数来更改docker服务日志级别。

示例：将日志级别设置为debug。

OPTIONS='--insecure-registry=172.30.0.0/16 --selinux-enabled --log-level=debug'

master和node节点失败

运行systemctl status命令，对atomicopenshift-master、atom-openshift-node、etcd和docker服务中的问题进行排查。使用journalctl -u 命令查看与前面列出的服务相关的系统日志。

可以通过在各自的配置文件中编辑–loglevel变量，然后重新启动关联的服务，来增加来自atom-openshift-node、atomicopenshift-master-controllers和atom-openshift-master-api服务的详细日志记录。

示例：设置OpenShift主控制器log level为debug级别，修改/etc/sysconfig/atomic-openshift-master-controllers文件。

OPTIONS=--loglevel=4 --listen=https://0.0.0.0:8444

延伸：

Red Hat OpenShift容器平台有五个级别的日志详细程度，无论日志配置如何，日志中都会出现带有致命、错误、警告和某些信息严重程度的消息。

0：只有错误和警告
2：正常信息(默认)
4：debug级信息
6：api级debug信息(请求/响应)
8：带有完整请求体的API debug信息

调度pod失败

OpenShift master调度pod在node上运行，通常由于node本身没有处于就绪状态，也由于资源限制和配额，pod无法运行。

使用oc get nodes命令验证节点的状态。在调度失败期间，pod将处于挂起状态，可以使用oc get pods -o wide命令进行检查，该命令还显示了计划在哪个节点上运行pod。使用oc get events和oc describe pod命令检查调度失败的详细信息。

示例1：如下所示pod调度失败，原因是CPU不足。

{
    default-scheduler } Warning FailedScheduling pod (FIXEDhello-phb4j) failed to
fit in any node
fit failure on node (hello-wx0s): Insufficient cpu
fit failure on node (hello-tgfm): Insufficient cpu
fit failure on node (hello-qwds): Insufficient cpu

示例2：如下所示pod调度失败，原因是节点没有处于就绪状态，可通过oc describe排查。

{
    default-scheduler } Warning FailedScheduling pod (hello-phb4j): no nodes
available to schedule pods

课本练习

环境准备

[[email protected] ~]$ lab install-prepare setup
[[email protected] ~]$ cd /home/student/do280-ansible
[[email protected] do280-ansible]$ ./install.sh

提示：若已经拥有一个完整环境，可不执行。

本练习准备

[[email protected] ~]$ lab common-troubleshoot setup

创建应用

[[email protected] ~]$ oc login -u developer -p redhat  https://master.lab.example.com
[[email protected] ~]$ oc new-project common-troubleshoot
[[email protected] ~]$ oc new-app --name=hello -i php:5.4 \
 http://services.lab.example.com/php-helloworld         # 从源代码创建应用
error: multiple images or templates matched "php:5.4": 2

The argument "php:5.4" could apply to the following Docker images, OpenShift image streams, or templates:

* Image stream "php" (tag "5.6") in project "openshift"
  Use --image-stream="openshift/php:5.6" to specify this image or template

* Image stream "php" (tag "7.0") in project "openshift"
  Use --image-stream="openshift/php:7.0" to specify this image or template

查看详情

[[email protected] ~]$ oc describe is php -n openshift
7.1 (latest)
  tagged from registry.lab.example.com/rhscl/php-71-rhel7:latest

  Build and run PHP 7.1 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/7.1/README.md.
  Tags: builder, php
  Supports: php:7.1, php
  Example Repo: https://github.com/openshift/cakephp-ex.git

  ! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/rhscl/php-71-rhel7:latest" not found
      3 days ago
…………
5.5
  tagged from registry.lab.example.com/openshift3/php-55-rhel7:latest

  Build and run PHP 5.5 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/5.5/README.md.
  Tags: hidden, builder, php
  Supports: php:5.5, php
  Example Repo: https://github.com/openshift/cakephp-ex.git

  ! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/openshift3/php-55-rhel7:latest" not found

结论：由上可知，仓库中不存在所需镜像。

修正错误

[[email protected] ~]$ oc new-app --name=hello -i php:7.0 http://services.lab.example.com/php-helloworld
[[email protected] ~]$ oc get pod -o wide      # 再次查看发现一只出于pending
NAME READY STATUS RESTARTS AGE IP NODE
hello-1-build 0/1 Pending 0 40s <none> <none>

查看详情

[[email protected] ~]$ oc log hello-1-build		# 查看log
W0301 17:25:02.867828    4584 cmd.go:358] log is DEPRECATED and will be removed in a future version. Use logs instead. 

[[email protected] ~]$ oc get events			# 查看事件
LAST SEEN   FIRST SEEN   COUNT     NAME                             KIND      SUBOBJECT   TYPE      REASON             SOURCE              MESSAGE
16s         47s          7         hello-1-build.16682daab914ecb6   Pod                   Warning   FailedScheduling   default-scheduler   0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
[[email protected] ~]$ oc describe pod hello-1-build	# 查看详情
……
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  23s (x8 over 1m)  default-scheduler  0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.

结论：由上可知，没有node可供调度此pod。

[[email protected] ~]# oc get nodes # 在master节点进一步排查node情况
NAME                     STATUS     ROLES     AGE       VERSION
master.lab.example.com   Ready      master    1d        v1.9.1+a0ce1bc657
node1.lab.example.com    NotReady   compute   1d        v1.9.1+a0ce1bc657
node2.lab.example.com    NotReady   compute   1d        v1.9.1+a0ce1bc657

结论：由上可知，node状态异常，都未出于ready状态。

检查服务

[[email protected] ~]# systemctl status atomic-openshift-node.service
[[email protected] ~]# systemctl status atomic-openshift-node.service
[[email protected] ~]# systemctl status docker
[[email protected] ~]# systemctl status docker
[[email protected] ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2021-03-01 17:23:12 CST; 4min 52s ago
     Docs: http://docs.docker.com
 Main PID: 17637 (code=exited, status=0/SUCCESS)

Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.375792111+08:00" level=e...\"" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.382396227+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.387020843+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.394091193+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.402339410+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.404059183+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.413005258+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.436107140+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.485170808+08:00" level=i...ed"
Mar 01 17:23:12 node1.lab.example.com systemd[1]: Stopped Docker Application Container Engine.
Hint: Some lines were ellipsized, use -l to show in full.

结论：由上可知，node节点的docker异常。

启动服务

[[email protected] ~]# systemctl start docker
[[email protected] ~]# systemctl start docker

确认验证

[[email protected] ~]# oc get nodes # 再次查看node状态
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready master 1d v1.9.1+a0ce1bc657
node1.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657
node2.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657

[[email protected] ~]$ oc get pods       # 确认pod是否正常调度至node
NAME READY STATUS RESTARTS AGE
hello-1-build 1/1 Running 0 22m

[[email protected] ~]$ oc describe is    # 查看is详情
Name:			hello
Namespace:		common-troubleshoot
Created:		15 minutes ago
Labels:			app=hello
Annotations:		openshift.io/generated-by=OpenShiftNewApp
Docker Pull Spec:	docker-registry.default.svc:5000/common-troubleshoot/hello
Image Lookup:		local=false
Unique Images:		1
Tags:			1

latest
  no spec tag

  * docker-registry.default.svc:5000/common-troubleshoot/[email protected]:8d63ed61d6e9c74933fe0d0d8aadceecb71751abf260f10645c19737a3e13354
      10 minutes ago

结论：由上可知，IS也将image推送至内部仓库。

清除项目

[[email protected] ~]$ oc delete project common-troubleshoot

综合实验

环境准备

[[email protected] ~]$ lab install-prepare setup
[[email protected] ~]$ cd /home/student/do280-ansible
[[email protected] do280-ansible]$ ./install.sh

提示：若已经拥有一个完整环境，可不执行。

本练习准备

[[email protected] ~]$ lab execute-review setup

git项目至本地

[[email protected] ~]$ cd /home/student/DO280/labs/execute-review/
[[email protected] execute-review]$ git clone http://services.lab.example.com/node-hello

docker构建image

[[email protected] execute-review]$ cd node-hello/
[[email protected] node-hello]$ docker build -t node-hello:latest .
[[email protected] node-hello]$ docker images              # 查看image
REPOSITORY                                      TAG                 IMAGE ID            CREATED             SIZE
node-hello                                      latest              9b3befb0536b        9 seconds ago       495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7   latest              fba56b5381b7        3 years ago         489 MB

修改docker tag

[[email protected] node-hello]$ docker tag 9b3befb0536b registry.lab.example.com/node-hello:latest
[[email protected] node-hello]$ docker images
REPOSITORY                                      TAG                 IMAGE ID            CREATED              SIZE
node-hello                                      latest              9b3befb0536b        About a minute ago   495 MB
registry.lab.example.com/node-hello             latest              9b3befb0536b        About a minute ago   495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7   latest              fba56b5381b7        3 years ago          489 MB

push image

[[email protected] node-hello]$ docker push registry.lab.example.com/node-hello:latest

创建project

[[email protected] ~]$ oc login -u developer -p redhat https://master.lab.example.com
[[email protected] ~]$ oc projects
[[email protected] ~]$ oc project execute-review
[[email protected] ~]$ oc new-app registry.lab.example.com/node-hello --name hello
[[email protected] ~]$ oc get all           # 查看全部资源
NAME                      REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfigs/hello   1          1         1         config,image(hello:latest)

NAME                 DOCKER REPO                                             TAGS      UPDATED
imagestreams/hello   docker-registry.default.svc:5000/execute-review/hello   latest    12 seconds ago

NAME                READY     STATUS             RESTARTS   AGE
po/hello-1-deploy   1/1       Running            0          12s
po/hello-1-zswgc    0/1       ImagePullBackOff   0          9s

NAME         DESIRED   CURRENT   READY     AGE
rc/hello-1   1         1         0         12s

NAME        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
svc/hello   ClusterIP   172.30.7.229   <none>        3000/TCP,8080/TCP   12s

排查ImagePullBackOff

[[email protected] ~]$ oc logs hello-1-zswgc          # 查看日志
Error from server (BadRequest): container "hello" in pod "hello-1-zswgc " is waiting to start: trying and failing to pull image
[[email protected] ~]$ oc describe pod hello-1-zswgc   # 查看详情
[[email protected] ~]$ oc get events --sort-by='.metadata.creationTimestamp'   # 查看事件

结论：由上可知，为image pull失败。

手动pull镜像

[[email protected] ~]$ oc get pod -o wide
NAME             READY     STATUS             RESTARTS   AGE       IP            NODE
hello-1-deploy   1/1       Running            0          32s       10.129.0.93   node2.lab.example.com
hello-1-zswgc    0/1       ImagePullBackOff   0          30s       <none>        node2.lab.example.com 
[[email protected] ~]# docker pull registry.lab.example.com/node-hello
Using default tag: latest
Trying to pull repository registry.lab.example.com/node-hello ... 
All endpoints blocked.

结论：由上可知，所有endpoint都被阻塞了。这种类型的错误通常发生在OpenShift中，原因是不正确的部署配置或无效docker配置。

修正docker配置

[[email protected] ~]# vi /etc/sysconfig/docker
将BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io --block-registry registry.lab.example.com'
修改为
BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io'
[[email protected] ~]# systemctl restart docker

提示：node2也需要如上操作。

更新pod

[[email protected] ~]$ oc rollout latest hello
[[email protected] ~]$ oc get pods        # 确认
NAME             READY     STATUS    RESTARTS   AGE
hello-1-deploy   0/1       Error     0          10m
hello-2-scrbl    1/1       Running   0          28s

确认验证

[[email protected] ~]$ oc logs hello-2-scrbl 
nodejs server running on http://0.0.0.0:3000

暴露服务

[[email protected] ~]$ oc expose svc hello --hostname=hello.apps.lab.example.com
route "hello" exposed

测试服务

[[email protected] ~]$ curl http://hello.apps.lab.example.com
Hi! I am running on host -> hello-2-scrbl
[[email protected] ~]$ lab execute-review grade #脚本验证试验

清除实验

[[email protected] ~]$ oc delete project execute-review

总结

RHCA认证需要经历5门的学习与考试，还是需要花不少时间去学习与备考的，好好加油，可以噶🤪。

以上就是【金鱼哥】对 第四章 OpenShift命令及故障排查–常见故障排除和章节实验 的简述和讲解。希望能对看到此文章的小伙伴有所帮助。

红帽认证专栏系列：
RHCSA专栏：戏说 RHCSA 认证
RHCE专栏：戏说 RHCE 认证
此文章收录在RHCA专栏：RHCA 回忆录

如果这篇【文章】有帮助到你，希望可以给【金鱼哥】点个赞，创作不易，相比官方的陈述，我更喜欢用【通俗易懂】的文笔去讲解每一个知识点。

如果有对【运维技术】感兴趣，也欢迎关注️️️ 【金鱼哥】️️️，我将会给你带来巨大的【收获与惊喜】！

原网站

版权声明
本文为[IT民工金鱼哥]所创，转载请带上原文链接，感谢
https://blog.csdn.net/qq_41765918/article/details/125377175

当前位置：网站首页>DO280OpenShift命令及故障排查--常见故障排除和章节实验

DO280OpenShift命令及故障排查--常见故障排除和章节实验

文章目录

常见环境信息

常见诊断命令

oc get events

oc log

oc rsync

oc port-forward

常见故障

资源限制和配额问题

S2I build失败

ErrImagePull和imgpullback错误

docker配置异常

master和node节点失败

调度pod失败

课本练习

环境准备

本练习准备

创建应用

查看详情

修正错误

查看详情

检查服务

启动服务

确认验证

清除项目

综合实验

环境准备

本练习准备

git项目至本地

docker构建image

修改docker tag

push image

创建project

排查ImagePullBackOff

手动pull镜像

修正docker配置

更新pod

确认验证

暴露服务

测试服务

清除实验

总结

边栏推荐

猜你喜欢

随机推荐