当前位置:网站首页>flink on yarn 集群模式启动报错及解决方案汇总
flink on yarn 集群模式启动报错及解决方案汇总
2022-08-05 05:14:00 【bigdata1024】
注意:想要使用flink on yarn 模式,需要确保hadoop集群启动成功,并且需要在yarn的某一个节点上面执行flink on yarn的脚本
- 没有启动hadoop集群,执行flink的bin/yarn-session.sh脚本会报下面错误
脚本会一直卡在这里,一直输出重试日志,连不上resoucemanager,说明hadoop集群每启动
2018-03-17 12:30:09,231 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
解决方案:启动hadoop集群即可
2018-03-17 12:30:08,062 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2018-03-17 12:30:09,231 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:10,234 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:11,235 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:12,238 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:13,240 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:14,247 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)- ./bin/yarn-session.sh -n 1 -jm 1024 -tm 1024 启动失败
经测试发现,是由于分配的内存太大导致的,把分配的内存调小,尝试改为800 即可正常启动
这个报错是虚拟内存超出限制,有可能你用的是虚拟机或者你们的服务器也是虚拟化出来的,可能就会报这个错误
这是因为有虚拟内存的设置,而使用的过程中超出了虚拟内存的限制,所以报错
解决办法:
在etc/hadoop/yarn-site.xml文件中,修改检查虚拟内存的属性为false,如下:
具体报错日志信息<property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
2018-03-17 21:50:10,456 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:55053] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2018-03-17 21:50:21,680 WARN org.apache.flink.yarn.cli.FlinkYarnSessionCli - Could not retrieve the current cluster status. Skipping current retrieval attempt ... java.lang.RuntimeException: Unable to get ClusterClient status from Application Client at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:253) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:443) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:720) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:514) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:511) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:511) Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running. at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:862) at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248) ... 9 more Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway. at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79) at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:857) ... 10 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77) ... 11 more 2018-03-17 21:50:21,691 WARN org.apache.flink.yarn.YarnClusterClient - YARN reported application state FAILED 2018-03-17 21:50:21,692 WARN org.apache.flink.yarn.YarnClusterClient - Diagnostics: Application application_1521277661809_0006 failed 1 times due to AM Container for appattempt_1521277661809_0006_000001 exited with exitCode: -103 For more detailed output, check application tracking page:http://hadoop100:8088/cluster/app/application_1521277661809_0006Then, click on links to logs of each attempt. Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1521277661809_0006_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 6386 6384 6386 6386 (bash) 0 0 108625920 331 /bin/bash -c /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 1> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.out 2> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.err |- 6401 6386 6386 6386 (java) 388 72 2287009792 63800 /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Failing this attempt. Failing the application. The YARN cluster has failed 2018-03-17 21:50:21,693 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master 2018-03-17 21:50:21,695 WARN org.apache.flink.yarn.YarnClusterClient - YARN reported application state FAILED 2018-03-17 21:50:21,695 WARN org.apache.flink.yarn.YarnClusterClient - Diagnostics: Application application_1521277661809_0006 failed 1 times due to AM Container for appattempt_1521277661809_0006_000001 exited with exitCode: -103 For more detailed output, check application tracking page:http://hadoop100:8088/cluster/app/application_1521277661809_0006Then, click on links to logs of each attempt. Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1521277661809_0006_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 6386 6384 6386 6386 (bash) 0 0 108625920 331 /bin/bash -c /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 1> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.out 2> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.err |- 6401 6386 6386 6386 (java) 388 72 2287009792 63800 /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Failing this attempt. Failing the application. 2018-03-17 21:50:21,697 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2018-03-17 21:50:21,726 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hadoop100/192.168.99.100:55053 2018-03-17 21:50:21,733 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:55053] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[email protected]:55053]] Caused by: [Connection refused: hadoop100/192.168.99.100:55053] 2018-03-17 21:50:31,707 WARN org.apache.flink.yarn.YarnClusterClient - Error while stopping YARN cluster. java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.ready(package.scala:169) at scala.concurrent.Await.ready(package.scala) at org.apache.flink.yarn.YarnClusterClient.shutdownCluster(YarnClusterClient.java:377) at org.apache.flink.yarn.YarnClusterClient.finalizeCluster(YarnClusterClient.java:347) at org.apache.flink.client.program.ClusterClient.shutdown(ClusterClient.java:263) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:466) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:720) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:514) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:511) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:511) 2018-03-17 21:50:31,711 INFO org.apache.flink.yarn.YarnClusterClient - Deleted Yarn properties file at /tmp/.yarn-properties-root 2018-03-17 21:50:31,881 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1521277661809_0006 finished with state FAILED and final state FAILED at 1521294610146 2018-03-17 21:50:31,882 WARN org.apache.flink.yarn.YarnClusterClient - Application failed. Diagnostics Application application_1521277661809_0006 failed 1 times due to AM Container for appattempt_1521277661809_0006_000001 exited with exitCode: -103 For more detailed output, check application tracking page:http://hadoop100:8088/cluster/app/application_1521277661809_0006Then, click on links to logs of each attempt. Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1521277661809_0006_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 6386 6384 6386 6386 (bash) 0 0 108625920 331 /bin/bash -c /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 1> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.out 2> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.err |- 6401 6386 6386 6386 (java) 388 72 2287009792 63800 /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Failing this attempt. Failing the application. 2018-03-17 21:50:31,884 WARN org.apache.flink.yarn.YarnClusterClient - If log aggregation is activated in the Hadoop cluster, we recommend to retrieve the full application log using this command: yarn logs -applicationId application_1521277661809_0006 (It sometimes takes a few seconds until the logs are aggregated) 2018-03-17 21:50:31,885 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down 2018-03-17 21:50:31,909 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client. 2018-03-17 21:50:31,911 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager Actor[akka.tcp://[email protected]:55053/user/jobmanager#119148826]. 2018-03-17 21:50:31,916 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon. 2018-03-17 21:50:31,926 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hadoop100/192.168.99.100:55053 2018-03-17 21:50:31,935 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:55053] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[email protected]:55053]] Caused by: [Connection refused: hadoop100/192.168.99.100:55053] 2018-03-17 21:50:31,935 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports. 2018-03-17 21:50:34,979 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Stopping interactive command line interface, YARN cluster has been stopped.
获取更多大数据资料,视频以及技术交流请加群:

边栏推荐
- 转正菜鸟前进中的经验(废话)之谈 持续更新中... ...
- jvm three heap and stack
- 【Reading】Long-term update
- 第三讲 Gradient Tutorial梯度下降与随机梯度下降
- 「PHP8入门指南」PHP简明介绍
- redis复制机制
- Transformation 和 Action 常用算子
- Flink 状态与容错 ( state 和 Fault Tolerance)
- "PHP8 Beginner's Guide" A brief introduction to PHP
- [Go through 8] Fully Connected Neural Network Video Notes
猜你喜欢
![LeetCode: 1403. Minimum subsequence in non-increasing order [greedy]](/img/99/41629dcd84e95eb3672d0555d6ef2c.png)
LeetCode: 1403. Minimum subsequence in non-increasing order [greedy]

OFDM 十六讲 5 -Discrete Convolution, ISI and ICI on DMT/OFDM Systems

Lecture 5 Using pytorch to implement linear regression

Detailed Explanation of Redis Sentinel Mode Configuration File

Using pip to install third-party libraries in Pycharm fails to install: "Non-zero exit code (2)" solution

flink实例开发-batch批处理实例

My 的第一篇博客!!!
![[Student Graduation Project] Design and Implementation of the Website Based on the Web Student Information Management System (13 pages)](/img/86/9c9a2541f2b7089ae47e9832fffdb3.png)
[Student Graduation Project] Design and Implementation of the Website Based on the Web Student Information Management System (13 pages)

【MySQL】数据库多表链接的查询方式

shell函数
随机推荐
Community Sharing|Tencent Overseas Games builds game security operation capabilities based on JumpServer
学习总结week2_5
解决端口占用问题
第三讲 Gradient Tutorial梯度下降与随机梯度下降
redis persistence
What are the characteristics of the interface of the physical layer?What does each contain?
npm搭建本地服务器,直接运行build后的目录
DOM and its applications
软件设计 实验四 桥接模式实验
Basic properties of binary tree + oj problem analysis
【过一下7】全连接神经网络视频第一节的笔记
怎么更改el-table-column的边框线
【过一下14】自习室的一天
ES6 生成器
[WeChat applet] WXML template syntax - conditional rendering
【After a while 6】Machine vision video 【After a while 2 was squeezed out】
Using QR codes to solve fixed asset management challenges
The mall background management system based on Web design and implementation
第二讲 Linear Model 线性模型
[Redis] Resid的删除策略