当前位置:网站首页>active RM机子断电后,RM HA切换正常。但是YarnUI上查看不到集群资源,application也一直处于ACCEPTED状态。
active RM机子断电后,RM HA切换正常。但是YarnUI上查看不到集群资源,application也一直处于ACCEPTED状态。
2022-06-22 04:05:00 【龟速扣代码】
目录
问题表现:
Active RM所在的机子断电,在ambari上看到了1分钟左右 RM HA主备切换成功了。即standby RM变成了 active RM
访问YarnUi上,查看集群的信息,发现memory和cores都是0. 但是ambari上看到集群的机子都正常运行。
application在yarnUi上看一直处于ACCEPTED状态,但是查看数据库一直有新的数据存入。
在等待15分钟左右后,application切换成了 RUNNING状态,yarnUI也能正常显示出集群的资源情况

排查思路:
判断是application问题 还是 yarn问题 导致
从现状3查看,application能正常处理存储数据,暂时可以排除application原因。yarn RM切换正常,但是并未显示集群资源信息。推测是不是RM和NM的通讯问题
在官网查查找对应的RM和NM通讯的参数并设置,未得到解决。
yarn.nodemanager.resourcemanager.connect.max-wait.ms 900000(默认值)
yarn.resourcemanager.connect.max-wait.ms 900000
yarn.resourcemanager.container.liveness-monitor.interval-ms 600000 查看 RM和NM的日志(DEBUG级别), 有IPC相关的一场INFO。但是没有明确是否是ERROR,再查看application日志,发现有socket链接超时。可以往IPC,SOCKET相关参数查找。
2019-12-26 04:31:16,736 INFO amlauncher.AMLauncher (AMLauncher.java:run(320)) - Error cleaning master
javax.security.sasl.SaslException: DIGEST-MD5: digest response format violation. Mismatched response. [Caused by org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.]
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateException(RPCUtil.java:53)
at org.apache.hadoop.yarn.ipc.RPCUtil.instantiateIOException(RPCUtil.java:80)
at org.apache.hadoop.yarn.ipc.RPCUtil.unwrapAndThrowException(RPCUtil.java:119)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:145)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:422)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:165)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:157)
at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95)
at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:359)
at com.sun.proxy.$Proxy97.stopContainers(Unknown Source)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.cleanup(AMLauncher.java:143)
at org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:318)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.hadoop.ipc.RemoteException(javax.security.sasl.SaslException): DIGEST-MD5: digest response format violation. Mismatched response.
at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498)
at org.apache.hadoop.ipc.Client.call(Client.java:1444)
at org.apache.hadoop.ipc.Client.call(Client.java:1354)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
at com.sun.proxy.$Proxy96.stopContainers(Unknown Source)
at org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.stopContainers(ContainerManagementProtocolPBClientImpl.java:142)
... 15 more
application日志,提示 socket链接超时。超时时间为20000ms
2019-12-26 09:23:47,051 DEBUG [main] RetryInvocationHandler:413 - org.apache.hadoop.net.ConnectTimeoutException: Call From xxxxxxxx-xxx-lab-vm-hdp-04/10.0.1.7 to xxxxxxxx-xxx-lab-vm-hdp-02:8030 failed on socket timeout exception:
org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxxxxxx-xxx-lab-vm-hdp-02/10.0.1.5:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout, while invoking ApplicationMasterProtocolPBClientImpl.registerApplicationMaster over rm1. Trying to failover immediately.2019-12-26 09:23:47,051 DEBUG [main] RetryInvocationHandler:413 - org.apache.hadoop.net.ConnectTimeoutException: Call From xxxxxxxx-xxx-lab-vm-hdp-04/10.0.1.7 to xxxxxxxx-xxx-lab-vm-hdp-02:8030 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxxxxxx-xxx-lab-vm-hdp-02/10.0.1.5:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout, while invoking ApplicationMasterProtocolPBClientImpl.registerApplicationMaster over rm1. Trying to failover immediately.org.apache.hadoop.net.ConnectTimeoutException: Call From xxxxxxxx-xxx-lab-vm-hdp-04/10.0.1.7 to xxxxxxxx-xxx-lab-vm-hdp-02:8030 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=xxxxxxxx-xxx-lab-vm-hdp-02/10.0.1.5:8030]; For more details see: http://wiki.apache.org/hadoop/SocketTimeout
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)去hadoop官网查看IPC相关的参数,在core.xml文件中查看到比较符合要求的.
ipc.client.connect.timeout 20000 客户端等待socket建立连接的时间
ipc.client.connect.max.retries.on.timeouts 45 客户端链接socket超时,重试次数去减少两个参数值后,yarnUI如预期生效。可以确认 是这两个参数。且在这过程中,在社区搜索到类似的问题描述
https://issues.apache.org/jira/browse/HADOOP-11252
https://issues.apache.org/jira/browse/YARN-2578
其中的解释是 ipc.client.rpc-timeout.ms 设置为0后,网络链接是不会超时,那么会降低为tcp级别重试链接。
问题解决
边栏推荐
- 使用Expanded布局时报错The following assertion was thrown during performLayout
- Invalid character found in request destination. Valid characters are defined in RFC 7230 and RFC 3986
- Ads communication between Beifu twincat3 controller and controller
- Redis和MySQL如何保持数据一致性?强一致性,弱一致性,最终一致性
- The continuous function of pytoch
- Huffman tree
- PHP connection to mysql8.0 reports an error: illuminate\database\queryexception
- IIR filter design basis and MATLAB design example
- Code of ultrasonic rangefinder based on 51 (screenshot version)
- Fonctionnement de base du tableau de séquence
猜你喜欢

Kubernetes 集群日志管理

Basic concept of graph

Fluentd语法配置

顺序表的基本操作

Beifu twincat3 ads error query list

Seven thousand word explanation of Alibaba cloud's new generation cloud computing architecture cipu

Blazor University (31) form - Validation

IDEA藍屏的解决方案

Wireshark packet analysis Wireshark 0051 pcap

Solend闹剧背后的「DeFi道德悖论」
随机推荐
Fluent syntax configuration
快速排序
DFS of graph
【牛客刷题-SQL大厂面试真题】NO1.某音短视频
axios get传参拼接数据库字段
be based on. NETCORE development blog project starblog - (12) razor page dynamic compilation
WebView error
Tianyang technology - Bank of Ningbo interview question [Hangzhou multi tester] [Hangzhou multi tester \wang Sir]
Pytorch之contiguous函数
低功耗雷达感应模组,智能锁雷达感应方案应用,智能雷达传感器技术
WPF achieves star effect
Replacement has 2 rows, data has 0, to solve how R language dynamically generates dataframes
Huffman tree
图的DFS
The following assertion was thrown during performlayout
IIR filter design basis and MATLAB design example
【写文章神器】标记语言Markdown的使用
Mqtt of NLog custom target
Ora-48132 ora-48170 appears in the alarm log
About calling EditText in dialog