当前位置:网站首页>DO280OpenShift命令及故障排查--常见故障排除和章节实验

DO280OpenShift命令及故障排查--常见故障排除和章节实验

2022-06-23 03:52:00 IT民工金鱼哥

个人简介:大家好,我是 金鱼哥,CSDN运维领域新星创作者,华为云·云享专家,阿里云社区·专家博主
个人资质:CCNA、HCNP、CSNA(网络分析师),软考初级、中级网络工程师、RHCSA、RHCE、RHCA、RHCI、ITIL
格言:努力不一定成功,但要想成功就必须努力

支持我:可点赞、可收藏️、可留言

常见环境信息

使用RPM安装的OCP,那么master和node的ocp相关服务将作为Red Hat Enterprise Linux服务运行。从master和node使用标准的sosreport实用程序,收集关于环境的信息,以及docker和openshift相关的信息。

[[email protected] ~]# sosreport -k docker.all=on -k docker.logs=on

sosreport命令创建一个包含所有相关信息的压缩归档文件,并将其保存在/var/tmp目录中。

另一个有用的诊断工具是oc adm diagnostics命令,能够在OpenShift集群上运行多个诊断检查,包括network、日志、内部仓库、master节点和node节点的服务检查等等。oc adm diagnostics --help命令,获取帮助。


常见诊断命令

oc客户端命令是用来检测和排除OpenShift集群中的问题的主要工具。它有许多选项,能够检测、诊断和修复由集群管理的主机和节点、服务和资源的问题。若已授权所需的权限,可以直接编辑集群中大多数托管资源的配置。


oc get events

事件允许OpenShift记录集群中生命周期事件的信息,以统一的方式查看关于OpenShift组件的信息。oc get events命令提供OpenShift namespace的事件信息,可实现以下事件的捕获:

    • Pod创建和删除
    • pod调度的节点
    • master和node节点的状态

事件通常用于故障排除,从而获得关于集群中的故障和问题的高级信息,然后使用日志文件和其他oc子命令进一步定位。

示例:使用以下命令获得特定项目中的事件列表。

[[email protected] ~]$ oc get events -n <project>

也可以通过Web控制台进行事件的查看events。


oc log

oc logs命令查看build、deployment或pod的日志输出,。

示例1:使用oc命令查看pod的日志。

[[email protected] ~]$ oc logs pod

示例2:使用oc命令查看build的日志。

[[email protected] ~]$ oc logs bc/build-name

使用oc logs命令和-f选项实时跟踪日志输出。例如,这对于连续监视build的进度和检查错误非常有用。

也可以通过Web控制台进行事件的查看log。


oc rsync

oc rsync命令将内容复制到正在运行的pod中的目录或从目录复制内容。如果一个pod有多个容器,可以使用-c选项指定容器ID。否则,它默认为pod中的第一个容器。通常用于从容器传输日志文件和配置文件。

示例1:将pod目录中的内容复制到本地目录。

[[email protected] ~]$ oc rsync <pod>:<pod_dir> <local_dir> -c <container>

示例2:将内容从本地目录复制到pod的目录中。

[[email protected] ~]$ oc rsync <local_dir> <pod>:<pod_dir> -c <container>

oc port-forward

使用oc port-forward命令将一个或多个本地端口转发到pod。这允许在本地监听特定或随机端口,并将数据转发到pod中的特定端口。

示例1:本地监听3306并转发到pod的3306.

[[email protected] ~]$ oc port-forward <pod> 3306:3306

常见故障

资源限制和配额问题

对于设置了资源限制和配额的项目,不适当的资源配置将导致部署失败。使用oc get eventsoc describe命令来排查失败的原因。

例如试图创建超过项目中pod数量配额限制的pod数量,那么在运行oc get events命令时会提示:

Warning FailedCreate {
    hello-1-deploy} Error creating: pods "hello-1" is forbidden:
exceeded quota: project-quota, requested: cpu=250m, used: cpu=750m, limited: cpu=900m

S2I build失败

使用oc logs命令查看S2I构建失败。例如,要查看名为hello的构建配置的日志:

[[email protected] ~]$ oc logs bc/hello

例如可以通过在build configuration策略中指定BUILD_LOGLEVEL环境变量来调整build日志的详细程度。

{
    
"sourceStrategy": {
    
...
"env": [
{
    
"name": "BUILD_LOGLEVEL",
"value": "5"
}
]
}
}

ErrImagePull和imgpullback错误

通常是由不正确的deployment configuration造成、部署期间引用的错误或缺少image或Docker配置不当造成。

使用oc get eventsoc describe命令排查,通过使用**oc edit dc/**编辑deployment configuration来修复错误。


docker配置异常

master和node上不正确的docker配置可能会在部署期间导致许多错误。

通常检查ADD_REGISTRY、INSECURE_REGISTRY和BLOCK_REGISTRY设置。使用systemctl status, oc logs, oc get events和oc describe命令对问题进行排查。

可以通添加**/etc/sysconfig/docker配置文件中的–log-level**参数来更改docker服务日志级别。

示例:将日志级别设置为debug。

OPTIONS='--insecure-registry=172.30.0.0/16 --selinux-enabled --log-level=debug'

master和node节点失败

运行systemctl status命令,对atomicopenshift-master、atom-openshift-node、etcd和docker服务中的问题进行排查。使用journalctl -u 命令查看与前面列出的服务相关的系统日志。

可以通过在各自的配置文件中编辑–loglevel变量,然后重新启动关联的服务,来增加来自atom-openshift-node、atomicopenshift-master-controllers和atom-openshift-master-api服务的详细日志记录。

示例:设置OpenShift主控制器log level为debug级别,修改/etc/sysconfig/atomic-openshift-master-controllers文件。

OPTIONS=--loglevel=4 --listen=https://0.0.0.0:8444

延伸:

Red Hat OpenShift容器平台有五个级别的日志详细程度,无论日志配置如何,日志中都会出现带有致命、错误、警告和某些信息严重程度的消息。

  • 0:只有错误和警告
  • 2:正常信息(默认)
  • 4:debug级信息
  • 6:api级debug信息(请求/响应)
  • 8:带有完整请求体的API debug信息

调度pod失败

OpenShift master调度pod在node上运行,通常由于node本身没有处于就绪状态,也由于资源限制和配额,pod无法运行。

使用oc get nodes命令验证节点的状态。在调度失败期间,pod将处于挂起状态,可以使用oc get pods -o wide命令进行检查,该命令还显示了计划在哪个节点上运行pod。使用oc get events和oc describe pod命令检查调度失败的详细信息。

示例1:如下所示pod调度失败,原因是CPU不足。

{
    default-scheduler } Warning FailedScheduling pod (FIXEDhello-phb4j) failed to
fit in any node
fit failure on node (hello-wx0s): Insufficient cpu
fit failure on node (hello-tgfm): Insufficient cpu
fit failure on node (hello-qwds): Insufficient cpu

示例2:如下所示pod调度失败,原因是节点没有处于就绪状态,可通过oc describe排查。

{
    default-scheduler } Warning FailedScheduling pod (hello-phb4j): no nodes
available to schedule pods

课本练习

环境准备

[[email protected] ~]$ lab install-prepare setup
[[email protected] ~]$ cd /home/student/do280-ansible
[[email protected] do280-ansible]$ ./install.sh

提示:若已经拥有一个完整环境,可不执行。


本练习准备

[[email protected] ~]$ lab common-troubleshoot setup

创建应用

[[email protected] ~]$ oc login -u developer -p redhat  https://master.lab.example.com
[[email protected] ~]$ oc new-project common-troubleshoot
[[email protected] ~]$ oc new-app --name=hello -i php:5.4 \
 http://services.lab.example.com/php-helloworld         # 从源代码创建应用
error: multiple images or templates matched "php:5.4": 2

The argument "php:5.4" could apply to the following Docker images, OpenShift image streams, or templates:

* Image stream "php" (tag "5.6") in project "openshift"
  Use --image-stream="openshift/php:5.6" to specify this image or template

* Image stream "php" (tag "7.0") in project "openshift"
  Use --image-stream="openshift/php:7.0" to specify this image or template

查看详情

[[email protected] ~]$ oc describe is php -n openshift
7.1 (latest)
  tagged from registry.lab.example.com/rhscl/php-71-rhel7:latest

  Build and run PHP 7.1 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/7.1/README.md.
  Tags: builder, php
  Supports: php:7.1, php
  Example Repo: https://github.com/openshift/cakephp-ex.git

  ! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/rhscl/php-71-rhel7:latest" not found
      3 days ago
…………
5.5
  tagged from registry.lab.example.com/openshift3/php-55-rhel7:latest

  Build and run PHP 5.5 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/5.5/README.md.
  Tags: hidden, builder, php
  Supports: php:5.5, php
  Example Repo: https://github.com/openshift/cakephp-ex.git

  ! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/openshift3/php-55-rhel7:latest" not found

结论:由上可知,仓库中不存在所需镜像。


修正错误

[[email protected] ~]$ oc new-app --name=hello -i php:7.0 http://services.lab.example.com/php-helloworld
[[email protected] ~]$ oc get pod -o wide      # 再次查看发现一只出于pending
NAME READY STATUS RESTARTS AGE IP NODE
hello-1-build 0/1 Pending 0 40s <none> <none>

查看详情

[[email protected] ~]$ oc log hello-1-build		# 查看log
W0301 17:25:02.867828    4584 cmd.go:358] log is DEPRECATED and will be removed in a future version. Use logs instead. 

[[email protected] ~]$ oc get events			# 查看事件
LAST SEEN   FIRST SEEN   COUNT     NAME                             KIND      SUBOBJECT   TYPE      REASON             SOURCE              MESSAGE
16s         47s          7         hello-1-build.16682daab914ecb6   Pod                   Warning   FailedScheduling   default-scheduler   0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
[[email protected] ~]$ oc describe pod hello-1-build	# 查看详情
……
Events:
  Type     Reason            Age               From               Message
  ----     ------            ----              ----               -------
  Warning  FailedScheduling  23s (x8 over 1m)  default-scheduler  0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.

结论:由上可知,没有node可供调度此pod。

[[email protected] ~]# oc get nodes # 在master节点进一步排查node情况
NAME                     STATUS     ROLES     AGE       VERSION
master.lab.example.com   Ready      master    1d        v1.9.1+a0ce1bc657
node1.lab.example.com    NotReady   compute   1d        v1.9.1+a0ce1bc657
node2.lab.example.com    NotReady   compute   1d        v1.9.1+a0ce1bc657

结论:由上可知,node状态异常,都未出于ready状态。


检查服务

[[email protected] ~]# systemctl status atomic-openshift-node.service
[[email protected] ~]# systemctl status atomic-openshift-node.service
[[email protected] ~]# systemctl status docker
[[email protected] ~]# systemctl status docker
[[email protected] ~]# systemctl status docker
● docker.service - Docker Application Container Engine
   Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
   Active: inactive (dead) since Mon 2021-03-01 17:23:12 CST; 4min 52s ago
     Docs: http://docs.docker.com
 Main PID: 17637 (code=exited, status=0/SUCCESS)

Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.375792111+08:00" level=e...\"" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.382396227+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.387020843+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.394091193+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.402339410+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.404059183+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.413005258+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.436107140+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.485170808+08:00" level=i...ed"
Mar 01 17:23:12 node1.lab.example.com systemd[1]: Stopped Docker Application Container Engine.
Hint: Some lines were ellipsized, use -l to show in full.

结论:由上可知,node节点的docker异常。


启动服务

[[email protected] ~]# systemctl start docker
[[email protected] ~]# systemctl start docker

确认验证

[[email protected] ~]# oc get nodes # 再次查看node状态
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready master 1d v1.9.1+a0ce1bc657
node1.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657
node2.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657

[[email protected] ~]$ oc get pods       # 确认pod是否正常调度至node
NAME READY STATUS RESTARTS AGE
hello-1-build 1/1 Running 0 22m

[[email protected] ~]$ oc describe is    # 查看is详情
Name:			hello
Namespace:		common-troubleshoot
Created:		15 minutes ago
Labels:			app=hello
Annotations:		openshift.io/generated-by=OpenShiftNewApp
Docker Pull Spec:	docker-registry.default.svc:5000/common-troubleshoot/hello
Image Lookup:		local=false
Unique Images:		1
Tags:			1

latest
  no spec tag

  * docker-registry.default.svc:5000/common-troubleshoot/[email protected]:8d63ed61d6e9c74933fe0d0d8aadceecb71751abf260f10645c19737a3e13354
      10 minutes ago

结论:由上可知,IS也将image推送至内部仓库。


清除项目

[[email protected] ~]$ oc delete project common-troubleshoot

综合实验

环境准备

[[email protected] ~]$ lab install-prepare setup
[[email protected] ~]$ cd /home/student/do280-ansible
[[email protected] do280-ansible]$ ./install.sh

提示:若已经拥有一个完整环境,可不执行。


本练习准备

[[email protected] ~]$ lab execute-review setup

git项目至本地

[[email protected] ~]$ cd /home/student/DO280/labs/execute-review/
[[email protected] execute-review]$ git clone http://services.lab.example.com/node-hello

docker构建image

[[email protected] execute-review]$ cd node-hello/
[[email protected] node-hello]$ docker build -t node-hello:latest .
[[email protected] node-hello]$ docker images              # 查看image
REPOSITORY                                      TAG                 IMAGE ID            CREATED             SIZE
node-hello                                      latest              9b3befb0536b        9 seconds ago       495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7   latest              fba56b5381b7        3 years ago         489 MB

修改docker tag

[[email protected] node-hello]$ docker tag 9b3befb0536b registry.lab.example.com/node-hello:latest
[[email protected] node-hello]$ docker images
REPOSITORY                                      TAG                 IMAGE ID            CREATED              SIZE
node-hello                                      latest              9b3befb0536b        About a minute ago   495 MB
registry.lab.example.com/node-hello             latest              9b3befb0536b        About a minute ago   495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7   latest              fba56b5381b7        3 years ago          489 MB

push image

[[email protected] node-hello]$ docker push registry.lab.example.com/node-hello:latest

创建project

[[email protected] ~]$ oc login -u developer -p redhat https://master.lab.example.com
[[email protected] ~]$ oc projects
[[email protected] ~]$ oc project execute-review
[[email protected] ~]$ oc new-app registry.lab.example.com/node-hello --name hello
[[email protected] ~]$ oc get all           # 查看全部资源
NAME                      REVISION   DESIRED   CURRENT   TRIGGERED BY
deploymentconfigs/hello   1          1         1         config,image(hello:latest)

NAME                 DOCKER REPO                                             TAGS      UPDATED
imagestreams/hello   docker-registry.default.svc:5000/execute-review/hello   latest    12 seconds ago

NAME                READY     STATUS             RESTARTS   AGE
po/hello-1-deploy   1/1       Running            0          12s
po/hello-1-zswgc    0/1       ImagePullBackOff   0          9s

NAME         DESIRED   CURRENT   READY     AGE
rc/hello-1   1         1         0         12s

NAME        TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)             AGE
svc/hello   ClusterIP   172.30.7.229   <none>        3000/TCP,8080/TCP   12s

排查ImagePullBackOff

[[email protected] ~]$ oc logs hello-1-zswgc          # 查看日志
Error from server (BadRequest): container "hello" in pod "hello-1-zswgc " is waiting to start: trying and failing to pull image
[[email protected] ~]$ oc describe pod hello-1-zswgc   # 查看详情
[[email protected] ~]$ oc get events --sort-by='.metadata.creationTimestamp'   # 查看事件

结论:由上可知,为image pull失败。


手动pull镜像

[[email protected] ~]$ oc get pod -o wide
NAME             READY     STATUS             RESTARTS   AGE       IP            NODE
hello-1-deploy   1/1       Running            0          32s       10.129.0.93   node2.lab.example.com
hello-1-zswgc    0/1       ImagePullBackOff   0          30s       <none>        node2.lab.example.com 
[[email protected] ~]# docker pull registry.lab.example.com/node-hello
Using default tag: latest
Trying to pull repository registry.lab.example.com/node-hello ... 
All endpoints blocked.

结论:由上可知,所有endpoint都被阻塞了。这种类型的错误通常发生在OpenShift中,原因是不正确的部署配置或无效docker配置。


修正docker配置

[[email protected] ~]# vi /etc/sysconfig/docker
将BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io --block-registry registry.lab.example.com'
修改为
BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io'
[[email protected] ~]# systemctl restart docker

提示:node2也需要如上操作。


更新pod

[[email protected] ~]$ oc rollout latest hello
[[email protected] ~]$ oc get pods        # 确认
NAME             READY     STATUS    RESTARTS   AGE
hello-1-deploy   0/1       Error     0          10m
hello-2-scrbl    1/1       Running   0          28s

确认验证

[[email protected] ~]$ oc logs hello-2-scrbl 
nodejs server running on http://0.0.0.0:3000

暴露服务

[[email protected] ~]$ oc expose svc hello --hostname=hello.apps.lab.example.com
route "hello" exposed

测试服务

[[email protected] ~]$ curl http://hello.apps.lab.example.com
Hi! I am running on host -> hello-2-scrbl
[[email protected] ~]$ lab execute-review grade #脚本验证试验

清除实验

[[email protected] ~]$ oc delete project execute-review

总结

RHCA认证需要经历5门的学习与考试,还是需要花不少时间去学习与备考的,好好加油,可以噶🤪。

以上就是【金鱼哥】对 第四章 OpenShift命令及故障排查–常见故障排除和章节实验 的简述和讲解。希望能对看到此文章的小伙伴有所帮助。

红帽认证专栏系列:
RHCSA专栏:戏说 RHCSA 认证
RHCE专栏:戏说 RHCE 认证
此文章收录在RHCA专栏:RHCA 回忆录

如果这篇【文章】有帮助到你,希望可以给【金鱼哥】点个赞,创作不易,相比官方的陈述,我更喜欢用【通俗易懂】的文笔去讲解每一个知识点。

如果有对【运维技术】感兴趣,也欢迎关注️️️ 【金鱼哥】️️️,我将会给你带来巨大的【收获与惊喜】!

原网站

版权声明
本文为[IT民工金鱼哥]所创,转载请带上原文链接,感谢
https://blog.csdn.net/qq_41765918/article/details/125377175