当前位置:网站首页>flink on yarn 集群模式启动报错及解决方案汇总
flink on yarn 集群模式启动报错及解决方案汇总
2022-08-05 05:14:00 【bigdata1024】
注意:想要使用flink on yarn 模式,需要确保hadoop集群启动成功,并且需要在yarn的某一个节点上面执行flink on yarn的脚本
- 没有启动hadoop集群,执行flink的bin/yarn-session.sh脚本会报下面错误
脚本会一直卡在这里,一直输出重试日志,连不上resoucemanager,说明hadoop集群每启动
2018-03-17 12:30:09,231 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
解决方案:启动hadoop集群即可
2018-03-17 12:30:08,062 INFO org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at /0.0.0.0:8032
2018-03-17 12:30:09,231 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:10,234 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:11,235 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 2 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:12,238 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 3 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:13,240 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 4 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
2018-03-17 12:30:14,247 INFO org.apache.hadoop.ipc.Client - Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 5 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1000 MILLISECONDS)
- ./bin/yarn-session.sh -n 1 -jm 1024 -tm 1024 启动失败
经测试发现,是由于分配的内存太大导致的,把分配的内存调小,尝试改为800 即可正常启动
这个报错是虚拟内存超出限制,有可能你用的是虚拟机或者你们的服务器也是虚拟化出来的,可能就会报这个错误
这是因为有虚拟内存的设置,而使用的过程中超出了虚拟内存的限制,所以报错
解决办法:
在etc/hadoop/yarn-site.xml文件中,修改检查虚拟内存的属性为false,如下:
具体报错日志信息<property> <name>yarn.nodemanager.vmem-check-enabled</name> <value>false</value> </property>
2018-03-17 21:50:10,456 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:55053] has failed, address is now gated for [5000] ms. Reason: [Disassociated] 2018-03-17 21:50:21,680 WARN org.apache.flink.yarn.cli.FlinkYarnSessionCli - Could not retrieve the current cluster status. Skipping current retrieval attempt ... java.lang.RuntimeException: Unable to get ClusterClient status from Application Client at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:253) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:443) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:720) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:514) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:511) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:511) Caused by: org.apache.flink.util.FlinkException: Could not connect to the leading JobManager. Please check that the JobManager is running. at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:862) at org.apache.flink.yarn.YarnClusterClient.getClusterStatus(YarnClusterClient.java:248) ... 9 more Caused by: org.apache.flink.runtime.leaderretrieval.LeaderRetrievalException: Could not retrieve the leader gateway. at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:79) at org.apache.flink.client.program.ClusterClient.getJobManagerGateway(ClusterClient.java:857) ... 10 more Caused by: java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:227) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:190) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:190) at scala.concurrent.Await.result(package.scala) at org.apache.flink.runtime.util.LeaderRetrievalUtils.retrieveLeaderGateway(LeaderRetrievalUtils.java:77) ... 11 more 2018-03-17 21:50:21,691 WARN org.apache.flink.yarn.YarnClusterClient - YARN reported application state FAILED 2018-03-17 21:50:21,692 WARN org.apache.flink.yarn.YarnClusterClient - Diagnostics: Application application_1521277661809_0006 failed 1 times due to AM Container for appattempt_1521277661809_0006_000001 exited with exitCode: -103 For more detailed output, check application tracking page:http://hadoop100:8088/cluster/app/application_1521277661809_0006Then, click on links to logs of each attempt. Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1521277661809_0006_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 6386 6384 6386 6386 (bash) 0 0 108625920 331 /bin/bash -c /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 1> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.out 2> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.err |- 6401 6386 6386 6386 (java) 388 72 2287009792 63800 /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Failing this attempt. Failing the application. The YARN cluster has failed 2018-03-17 21:50:21,693 INFO org.apache.flink.yarn.YarnClusterClient - Sending shutdown request to the Application Master 2018-03-17 21:50:21,695 WARN org.apache.flink.yarn.YarnClusterClient - YARN reported application state FAILED 2018-03-17 21:50:21,695 WARN org.apache.flink.yarn.YarnClusterClient - Diagnostics: Application application_1521277661809_0006 failed 1 times due to AM Container for appattempt_1521277661809_0006_000001 exited with exitCode: -103 For more detailed output, check application tracking page:http://hadoop100:8088/cluster/app/application_1521277661809_0006Then, click on links to logs of each attempt. Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1521277661809_0006_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 6386 6384 6386 6386 (bash) 0 0 108625920 331 /bin/bash -c /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 1> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.out 2> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.err |- 6401 6386 6386 6386 (java) 388 72 2287009792 63800 /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Failing this attempt. Failing the application. 2018-03-17 21:50:21,697 INFO org.apache.flink.yarn.ApplicationClient - Sending StopCluster request to JobManager. 2018-03-17 21:50:21,726 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hadoop100/192.168.99.100:55053 2018-03-17 21:50:21,733 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:55053] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[email protected]:55053]] Caused by: [Connection refused: hadoop100/192.168.99.100:55053] 2018-03-17 21:50:31,707 WARN org.apache.flink.yarn.YarnClusterClient - Error while stopping YARN cluster. java.util.concurrent.TimeoutException: Futures timed out after [10000 milliseconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:223) at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:157) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169) at scala.concurrent.Await$$anonfun$ready$1.apply(package.scala:169) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.ready(package.scala:169) at scala.concurrent.Await.ready(package.scala) at org.apache.flink.yarn.YarnClusterClient.shutdownCluster(YarnClusterClient.java:377) at org.apache.flink.yarn.YarnClusterClient.finalizeCluster(YarnClusterClient.java:347) at org.apache.flink.client.program.ClusterClient.shutdown(ClusterClient.java:263) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:466) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:720) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:514) at org.apache.flink.yarn.cli.FlinkYarnSessionCli$1.call(FlinkYarnSessionCli.java:511) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698) at org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) at org.apache.flink.yarn.cli.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:511) 2018-03-17 21:50:31,711 INFO org.apache.flink.yarn.YarnClusterClient - Deleted Yarn properties file at /tmp/.yarn-properties-root 2018-03-17 21:50:31,881 INFO org.apache.flink.yarn.YarnClusterClient - Application application_1521277661809_0006 finished with state FAILED and final state FAILED at 1521294610146 2018-03-17 21:50:31,882 WARN org.apache.flink.yarn.YarnClusterClient - Application failed. Diagnostics Application application_1521277661809_0006 failed 1 times due to AM Container for appattempt_1521277661809_0006_000001 exited with exitCode: -103 For more detailed output, check application tracking page:http://hadoop100:8088/cluster/app/application_1521277661809_0006Then, click on links to logs of each attempt. Diagnostics: Container [pid=6386,containerID=container_1521277661809_0006_01_000001] is running beyond virtual memory limits. Current usage: 250.5 MB of 1 GB physical memory used; 2.2 GB of 2.1 GB virtual memory used. Killing container. Dump of the process-tree for container_1521277661809_0006_01_000001 : |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE |- 6386 6384 6386 6386 (bash) 0 0 108625920 331 /bin/bash -c /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner 1> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.out 2> /usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.err |- 6401 6386 6386 6386 (java) 388 72 2287009792 63800 /usr/local/jdk/bin/java -Xmx424m -Dlog.file=/usr/local/hadoop/logs/userlogs/application_1521277661809_0006/container_1521277661809_0006_01_000001/jobmanager.log -Dlog4j.configuration=file:log4j.properties org.apache.flink.yarn.YarnApplicationMasterRunner Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143 Failing this attempt. Failing the application. 2018-03-17 21:50:31,884 WARN org.apache.flink.yarn.YarnClusterClient - If log aggregation is activated in the Hadoop cluster, we recommend to retrieve the full application log using this command: yarn logs -applicationId application_1521277661809_0006 (It sometimes takes a few seconds until the logs are aggregated) 2018-03-17 21:50:31,885 INFO org.apache.flink.yarn.YarnClusterClient - YARN Client is shutting down 2018-03-17 21:50:31,909 INFO org.apache.flink.yarn.ApplicationClient - Stopped Application client. 2018-03-17 21:50:31,911 INFO org.apache.flink.yarn.ApplicationClient - Disconnect from JobManager Actor[akka.tcp://[email protected]:55053/user/jobmanager#119148826]. 2018-03-17 21:50:31,916 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Shutting down remote daemon. 2018-03-17 21:50:31,926 WARN akka.remote.transport.netty.NettyTransport - Remote connection to [null] failed with java.net.ConnectException: Connection refused: hadoop100/192.168.99.100:55053 2018-03-17 21:50:31,935 WARN akka.remote.ReliableDeliverySupervisor - Association with remote system [akka.tcp://[email protected]:55053] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://[email protected]:55053]] Caused by: [Connection refused: hadoop100/192.168.99.100:55053] 2018-03-17 21:50:31,935 INFO akka.remote.RemoteActorRefProvider$RemotingTerminator - Remote daemon shut down; proceeding with flushing remote transports. 2018-03-17 21:50:34,979 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - Stopping interactive command line interface, YARN cluster has been stopped.
获取更多大数据资料,视频以及技术交流请加群:
边栏推荐
猜你喜欢
随机推荐
1.3 mysql批量插入数据
[Let's pass 14] A day in the study room
vscode+pytorch使用经验记录(个人记录+不定时更新)
【NFT开发】设计师无技术基础保姆级开发NFT教程在Opensea上全套开发一个NFT项目+构建Web3网站
OFDM Lecture 16 5 -Discrete Convolution, ISI and ICI on DMT/OFDM Systems
【过一下11】随机森林和特征工程
Dashboard Display | DataEase Look at China: Data Presents China's Capital Market
Lecture 4 Backpropagation Essays
ES6 Set、WeakSet
鼠标放上去变成销售效果
Xiaobai, you big bulls are lightly abused
range函数作用
SQL(二) —— join窗口函数视图
UVA10827
CAP+BASE
【技能】长期更新
Multi-threaded query results, add List collection
[Remember 1] June 29, 2022 Brother and brother double pain
【过一下7】全连接神经网络视频第一节的笔记
es6迭代协议