当前位置:网站首页>DO280OpenShift命令及故障排查--常见故障排除和章节实验
DO280OpenShift命令及故障排查--常见故障排除和章节实验
2022-06-23 03:52:00 【IT民工金鱼哥】
个人简介:大家好,我是 金鱼哥,CSDN运维领域新星创作者,华为云·云享专家,阿里云社区·专家博主
个人资质:CCNA、HCNP、CSNA(网络分析师),软考初级、中级网络工程师、RHCSA、RHCE、RHCA、RHCI、ITIL
格言:努力不一定成功,但要想成功就必须努力支持我:可点赞、可收藏️、可留言
常见环境信息
使用RPM安装的OCP,那么master和node的ocp相关服务将作为Red Hat Enterprise Linux服务运行。从master和node使用标准的sosreport实用程序,收集关于环境的信息,以及docker和openshift相关的信息。
[[email protected] ~]# sosreport -k docker.all=on -k docker.logs=on
sosreport命令创建一个包含所有相关信息的压缩归档文件,并将其保存在/var/tmp目录中。
另一个有用的诊断工具是oc adm diagnostics命令,能够在OpenShift集群上运行多个诊断检查,包括network、日志、内部仓库、master节点和node节点的服务检查等等。oc adm diagnostics --help命令,获取帮助。
常见诊断命令
oc客户端命令是用来检测和排除OpenShift集群中的问题的主要工具。它有许多选项,能够检测、诊断和修复由集群管理的主机和节点、服务和资源的问题。若已授权所需的权限,可以直接编辑集群中大多数托管资源的配置。
oc get events
事件允许OpenShift记录集群中生命周期事件的信息,以统一的方式查看关于OpenShift组件的信息。oc get events命令提供OpenShift namespace的事件信息,可实现以下事件的捕获:
- Pod创建和删除
- pod调度的节点
- master和node节点的状态
事件通常用于故障排除,从而获得关于集群中的故障和问题的高级信息,然后使用日志文件和其他oc子命令进一步定位。
示例:使用以下命令获得特定项目中的事件列表。
[[email protected] ~]$ oc get events -n <project>
也可以通过Web控制台进行事件的查看events。
oc log
oc logs命令查看build、deployment或pod的日志输出,。
示例1:使用oc命令查看pod的日志。
[[email protected] ~]$ oc logs pod
示例2:使用oc命令查看build的日志。
[[email protected] ~]$ oc logs bc/build-name
使用oc logs命令和-f选项实时跟踪日志输出。例如,这对于连续监视build的进度和检查错误非常有用。
也可以通过Web控制台进行事件的查看log。
oc rsync
oc rsync命令将内容复制到正在运行的pod中的目录或从目录复制内容。如果一个pod有多个容器,可以使用-c选项指定容器ID。否则,它默认为pod中的第一个容器。通常用于从容器传输日志文件和配置文件。
示例1:将pod目录中的内容复制到本地目录。
[[email protected] ~]$ oc rsync <pod>:<pod_dir> <local_dir> -c <container>
示例2:将内容从本地目录复制到pod的目录中。
[[email protected] ~]$ oc rsync <local_dir> <pod>:<pod_dir> -c <container>
oc port-forward
使用oc port-forward命令将一个或多个本地端口转发到pod。这允许在本地监听特定或随机端口,并将数据转发到pod中的特定端口。
示例1:本地监听3306并转发到pod的3306.
[[email protected] ~]$ oc port-forward <pod> 3306:3306
常见故障
资源限制和配额问题
对于设置了资源限制和配额的项目,不适当的资源配置将导致部署失败。使用oc get events和oc describe命令来排查失败的原因。
例如试图创建超过项目中pod数量配额限制的pod数量,那么在运行oc get events命令时会提示:
Warning FailedCreate {
hello-1-deploy} Error creating: pods "hello-1" is forbidden:
exceeded quota: project-quota, requested: cpu=250m, used: cpu=750m, limited: cpu=900m
S2I build失败
使用oc logs命令查看S2I构建失败。例如,要查看名为hello的构建配置的日志:
[[email protected] ~]$ oc logs bc/hello
例如可以通过在build configuration策略中指定BUILD_LOGLEVEL环境变量来调整build日志的详细程度。
{
"sourceStrategy": {
...
"env": [
{
"name": "BUILD_LOGLEVEL",
"value": "5"
}
]
}
}
ErrImagePull和imgpullback错误
通常是由不正确的deployment configuration造成、部署期间引用的错误或缺少image或Docker配置不当造成。
使用oc get events和oc describe命令排查,通过使用**oc edit dc/**编辑deployment configuration来修复错误。
docker配置异常
master和node上不正确的docker配置可能会在部署期间导致许多错误。
通常检查ADD_REGISTRY、INSECURE_REGISTRY和BLOCK_REGISTRY设置。使用systemctl status, oc logs, oc get events和oc describe命令对问题进行排查。
可以通添加**/etc/sysconfig/docker配置文件中的–log-level**参数来更改docker服务日志级别。
示例:将日志级别设置为debug。
OPTIONS='--insecure-registry=172.30.0.0/16 --selinux-enabled --log-level=debug'
master和node节点失败
运行systemctl status命令,对atomicopenshift-master、atom-openshift-node、etcd和docker服务中的问题进行排查。使用journalctl -u 命令查看与前面列出的服务相关的系统日志。
可以通过在各自的配置文件中编辑–loglevel变量,然后重新启动关联的服务,来增加来自atom-openshift-node、atomicopenshift-master-controllers和atom-openshift-master-api服务的详细日志记录。
示例:设置OpenShift主控制器log level为debug级别,修改/etc/sysconfig/atomic-openshift-master-controllers文件。
OPTIONS=--loglevel=4 --listen=https://0.0.0.0:8444
延伸:
Red Hat OpenShift容器平台有五个级别的日志详细程度,无论日志配置如何,日志中都会出现带有致命、错误、警告和某些信息严重程度的消息。
- 0:只有错误和警告
- 2:正常信息(默认)
- 4:debug级信息
- 6:api级debug信息(请求/响应)
- 8:带有完整请求体的API debug信息
调度pod失败
OpenShift master调度pod在node上运行,通常由于node本身没有处于就绪状态,也由于资源限制和配额,pod无法运行。
使用oc get nodes命令验证节点的状态。在调度失败期间,pod将处于挂起状态,可以使用oc get pods -o wide命令进行检查,该命令还显示了计划在哪个节点上运行pod。使用oc get events和oc describe pod命令检查调度失败的详细信息。
示例1:如下所示pod调度失败,原因是CPU不足。
{
default-scheduler } Warning FailedScheduling pod (FIXEDhello-phb4j) failed to
fit in any node
fit failure on node (hello-wx0s): Insufficient cpu
fit failure on node (hello-tgfm): Insufficient cpu
fit failure on node (hello-qwds): Insufficient cpu
示例2:如下所示pod调度失败,原因是节点没有处于就绪状态,可通过oc describe排查。
{
default-scheduler } Warning FailedScheduling pod (hello-phb4j): no nodes
available to schedule pods
课本练习
环境准备
[[email protected] ~]$ lab install-prepare setup
[[email protected] ~]$ cd /home/student/do280-ansible
[[email protected] do280-ansible]$ ./install.sh
提示:若已经拥有一个完整环境,可不执行。
本练习准备
[[email protected] ~]$ lab common-troubleshoot setup
创建应用
[[email protected] ~]$ oc login -u developer -p redhat https://master.lab.example.com
[[email protected] ~]$ oc new-project common-troubleshoot
[[email protected] ~]$ oc new-app --name=hello -i php:5.4 \
http://services.lab.example.com/php-helloworld # 从源代码创建应用
error: multiple images or templates matched "php:5.4": 2
The argument "php:5.4" could apply to the following Docker images, OpenShift image streams, or templates:
* Image stream "php" (tag "5.6") in project "openshift"
Use --image-stream="openshift/php:5.6" to specify this image or template
* Image stream "php" (tag "7.0") in project "openshift"
Use --image-stream="openshift/php:7.0" to specify this image or template
查看详情
[[email protected] ~]$ oc describe is php -n openshift
7.1 (latest)
tagged from registry.lab.example.com/rhscl/php-71-rhel7:latest
Build and run PHP 7.1 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/7.1/README.md.
Tags: builder, php
Supports: php:7.1, php
Example Repo: https://github.com/openshift/cakephp-ex.git
! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/rhscl/php-71-rhel7:latest" not found
3 days ago
…………
5.5
tagged from registry.lab.example.com/openshift3/php-55-rhel7:latest
Build and run PHP 5.5 applications on RHEL 7. For more information about using this builder image, including OpenShift considerations, see https://github.com/sclorg/s2i-php-container/blob/master/5.5/README.md.
Tags: hidden, builder, php
Supports: php:5.5, php
Example Repo: https://github.com/openshift/cakephp-ex.git
! error: Import failed (NotFound): dockerimage.image.openshift.io "registry.lab.example.com/openshift3/php-55-rhel7:latest" not found
结论:由上可知,仓库中不存在所需镜像。
修正错误
[[email protected] ~]$ oc new-app --name=hello -i php:7.0 http://services.lab.example.com/php-helloworld
[[email protected] ~]$ oc get pod -o wide # 再次查看发现一只出于pending
NAME READY STATUS RESTARTS AGE IP NODE
hello-1-build 0/1 Pending 0 40s <none> <none>
查看详情
[[email protected] ~]$ oc log hello-1-build # 查看log
W0301 17:25:02.867828 4584 cmd.go:358] log is DEPRECATED and will be removed in a future version. Use logs instead.
[[email protected] ~]$ oc get events # 查看事件
LAST SEEN FIRST SEEN COUNT NAME KIND SUBOBJECT TYPE REASON SOURCE MESSAGE
16s 47s 7 hello-1-build.16682daab914ecb6 Pod Warning FailedScheduling default-scheduler 0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
[[email protected] ~]$ oc describe pod hello-1-build # 查看详情
……
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 23s (x8 over 1m) default-scheduler 0/3 nodes are available: 1 MatchNodeSelector, 2 NodeNotReady.
结论:由上可知,没有node可供调度此pod。
[[email protected] ~]# oc get nodes # 在master节点进一步排查node情况
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready master 1d v1.9.1+a0ce1bc657
node1.lab.example.com NotReady compute 1d v1.9.1+a0ce1bc657
node2.lab.example.com NotReady compute 1d v1.9.1+a0ce1bc657
结论:由上可知,node状态异常,都未出于ready状态。
检查服务
[[email protected] ~]# systemctl status atomic-openshift-node.service
[[email protected] ~]# systemctl status atomic-openshift-node.service
[[email protected] ~]# systemctl status docker
[[email protected] ~]# systemctl status docker
[[email protected] ~]# systemctl status docker
● docker.service - Docker Application Container Engine
Loaded: loaded (/usr/lib/systemd/system/docker.service; disabled; vendor preset: disabled)
Active: inactive (dead) since Mon 2021-03-01 17:23:12 CST; 4min 52s ago
Docs: http://docs.docker.com
Main PID: 17637 (code=exited, status=0/SUCCESS)
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.375792111+08:00" level=e...\"" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.382396227+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.387020843+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.394091193+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.402339410+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.404059183+08:00" level=e...\""
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.413005258+08:00" level=w...nt" Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.436107140+08:00" level=w...nt"
Mar 01 17:23:11 node1.lab.example.com dockerd-current[17637]: time="2021-03-01T17:23:11.485170808+08:00" level=i...ed"
Mar 01 17:23:12 node1.lab.example.com systemd[1]: Stopped Docker Application Container Engine.
Hint: Some lines were ellipsized, use -l to show in full.
结论:由上可知,node节点的docker异常。
启动服务
[[email protected] ~]# systemctl start docker
[[email protected] ~]# systemctl start docker
确认验证
[[email protected] ~]# oc get nodes # 再次查看node状态
NAME STATUS ROLES AGE VERSION
master.lab.example.com Ready master 1d v1.9.1+a0ce1bc657
node1.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657
node2.lab.example.com Ready compute 1d v1.9.1+a0ce1bc657
[[email protected] ~]$ oc get pods # 确认pod是否正常调度至node
NAME READY STATUS RESTARTS AGE
hello-1-build 1/1 Running 0 22m
[[email protected] ~]$ oc describe is # 查看is详情
Name: hello
Namespace: common-troubleshoot
Created: 15 minutes ago
Labels: app=hello
Annotations: openshift.io/generated-by=OpenShiftNewApp
Docker Pull Spec: docker-registry.default.svc:5000/common-troubleshoot/hello
Image Lookup: local=false
Unique Images: 1
Tags: 1
latest
no spec tag
* docker-registry.default.svc:5000/common-troubleshoot/[email protected]:8d63ed61d6e9c74933fe0d0d8aadceecb71751abf260f10645c19737a3e13354
10 minutes ago
结论:由上可知,IS也将image推送至内部仓库。
清除项目
[[email protected] ~]$ oc delete project common-troubleshoot
综合实验
环境准备
[[email protected] ~]$ lab install-prepare setup
[[email protected] ~]$ cd /home/student/do280-ansible
[[email protected] do280-ansible]$ ./install.sh
提示:若已经拥有一个完整环境,可不执行。
本练习准备
[[email protected] ~]$ lab execute-review setup
git项目至本地
[[email protected] ~]$ cd /home/student/DO280/labs/execute-review/
[[email protected] execute-review]$ git clone http://services.lab.example.com/node-hello
docker构建image
[[email protected] execute-review]$ cd node-hello/
[[email protected] node-hello]$ docker build -t node-hello:latest .
[[email protected] node-hello]$ docker images # 查看image
REPOSITORY TAG IMAGE ID CREATED SIZE
node-hello latest 9b3befb0536b 9 seconds ago 495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7 latest fba56b5381b7 3 years ago 489 MB
修改docker tag
[[email protected] node-hello]$ docker tag 9b3befb0536b registry.lab.example.com/node-hello:latest
[[email protected] node-hello]$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
node-hello latest 9b3befb0536b About a minute ago 495 MB
registry.lab.example.com/node-hello latest 9b3befb0536b About a minute ago 495 MB
registry.lab.example.com/rhscl/nodejs-6-rhel7 latest fba56b5381b7 3 years ago 489 MB
push image
[[email protected] node-hello]$ docker push registry.lab.example.com/node-hello:latest
创建project
[[email protected] ~]$ oc login -u developer -p redhat https://master.lab.example.com
[[email protected] ~]$ oc projects
[[email protected] ~]$ oc project execute-review
[[email protected] ~]$ oc new-app registry.lab.example.com/node-hello --name hello
[[email protected] ~]$ oc get all # 查看全部资源
NAME REVISION DESIRED CURRENT TRIGGERED BY
deploymentconfigs/hello 1 1 1 config,image(hello:latest)
NAME DOCKER REPO TAGS UPDATED
imagestreams/hello docker-registry.default.svc:5000/execute-review/hello latest 12 seconds ago
NAME READY STATUS RESTARTS AGE
po/hello-1-deploy 1/1 Running 0 12s
po/hello-1-zswgc 0/1 ImagePullBackOff 0 9s
NAME DESIRED CURRENT READY AGE
rc/hello-1 1 1 0 12s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
svc/hello ClusterIP 172.30.7.229 <none> 3000/TCP,8080/TCP 12s
排查ImagePullBackOff
[[email protected] ~]$ oc logs hello-1-zswgc # 查看日志
Error from server (BadRequest): container "hello" in pod "hello-1-zswgc " is waiting to start: trying and failing to pull image
[[email protected] ~]$ oc describe pod hello-1-zswgc # 查看详情
[[email protected] ~]$ oc get events --sort-by='.metadata.creationTimestamp' # 查看事件
结论:由上可知,为image pull失败。
手动pull镜像
[[email protected] ~]$ oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE
hello-1-deploy 1/1 Running 0 32s 10.129.0.93 node2.lab.example.com
hello-1-zswgc 0/1 ImagePullBackOff 0 30s <none> node2.lab.example.com
[[email protected] ~]# docker pull registry.lab.example.com/node-hello
Using default tag: latest
Trying to pull repository registry.lab.example.com/node-hello ...
All endpoints blocked.
结论:由上可知,所有endpoint都被阻塞了。这种类型的错误通常发生在OpenShift中,原因是不正确的部署配置或无效docker配置。
修正docker配置
[[email protected] ~]# vi /etc/sysconfig/docker
将BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io --block-registry registry.lab.example.com'
修改为
BLOCK_REGISTRY='--block-registry registry.access.redhat.com --block-registry docker.io'
[[email protected] ~]# systemctl restart docker
提示:node2也需要如上操作。
更新pod
[[email protected] ~]$ oc rollout latest hello
[[email protected] ~]$ oc get pods # 确认
NAME READY STATUS RESTARTS AGE
hello-1-deploy 0/1 Error 0 10m
hello-2-scrbl 1/1 Running 0 28s
确认验证
[[email protected] ~]$ oc logs hello-2-scrbl
nodejs server running on http://0.0.0.0:3000
暴露服务
[[email protected] ~]$ oc expose svc hello --hostname=hello.apps.lab.example.com
route "hello" exposed
测试服务
[[email protected] ~]$ curl http://hello.apps.lab.example.com
Hi! I am running on host -> hello-2-scrbl
[[email protected] ~]$ lab execute-review grade #脚本验证试验
清除实验
[[email protected] ~]$ oc delete project execute-review
总结
RHCA认证需要经历5门的学习与考试,还是需要花不少时间去学习与备考的,好好加油,可以噶🤪。

以上就是【金鱼哥】对 第四章 OpenShift命令及故障排查–常见故障排除和章节实验 的简述和讲解。希望能对看到此文章的小伙伴有所帮助。
红帽认证专栏系列:
RHCSA专栏:戏说 RHCSA 认证
RHCE专栏:戏说 RHCE 认证
此文章收录在RHCA专栏:RHCA 回忆录
如果这篇【文章】有帮助到你,希望可以给【金鱼哥】点个赞,创作不易,相比官方的陈述,我更喜欢用【通俗易懂】的文笔去讲解每一个知识点。
如果有对【运维技术】感兴趣,也欢迎关注️️️ 【金鱼哥】️️️,我将会给你带来巨大的【收获与惊喜】!

边栏推荐
猜你喜欢

Abnova abcb10 (human) recombinant protein specification

OGNL Object-Graph Navigation Language

独立站聊天机器人有哪些类型?如何快速创建属于自己的免费聊天机器人?只需3秒钟就能搞定!

不归零编码NRZ

DSP7 环境

Cve-2019-14287 (sudo right raising)

Mini Homer——几百块钱也能搞到一台远距离图数传链路?

2 万字 + 20张图|细说 Redis 九种数据类型和应用场景

VGg Chinese herbal medicine identification

Dpr-34v/v two position relay
随机推荐
Using editor How to handle MD uploading pictures?
gson TypeAdapter 适配器
重装Cadence16.3,失败与成功
const理解之一
altium designer 09丝印靠近焊盘显示绿色警告,如何阻止其报警?
MySQL import large files (can be millions or hundreds)
Abnova LiquidCell-负富集细胞分离和回收系统
STL教程3-异常机制
2 万字 + 20张图|细说 Redis 九种数据类型和应用场景
Static two position relay gls-3004k/dc220v
OGNL Object-Graph Navigation Language
Mini Homer——几百块钱也能搞到一台远距离图数传链路?
Icer Skill 02makefile script Running VCS Simulation
反编译
Alkylation process test questions and simulation test in 2022
DPR-34V/V双位置继电器
WPF 基础控件之 TabControl样式
Freemodbus parsing 1
2020:VL-BERT: Pre-training of generic visual-linguistic representation
【论文阅读】Semi-Supervised Learning with Ladder Networks