当前位置：网站首页>Spark submission parameters -- use of files

Spark submission parameters -- use of files

2022-07-25 15:15:00 【The south wind knows what I mean】

Project scenario ：

We have two clusters （ps： Computing Cluster / Storage cluster ）, Now there is a need for , Computing cluster runs Spark Mission , from kafka Write the data to the storage cluster hive

Problem description

Read and write data across clusters , We tested writing hbase It can be written from the computing cluster to the storage cluster , And it can be written in .
But once you write hive He just doesn't write about storage clusters hive in , Each time, it only writes about the computing cluster hive in .
It's hard for me to understand , And I am here IDEA During the test , Can be written to the storage cluster hive in , Once you get on the dolphin, put it on the cluster and run He wrote that he had deviated , It is written to the computing cluster hive Inside the .
I am here resource The folder also contains the storage cluster core-site.xml hdfs-site.xml hive-site.xml The file , I also wrote in the code changeNameNode The method . But the program still seems unable to switch to the storage cluster when running NN Up
Insert picture description here

/*** * @Author: lzx * @Description: * @Date: 2022/5/27 * @Param session: bulid well Sparkssion * @Param nameSpace:  The namespace of the cluster  * @Param nn1: nn1_ID * @Param nn1Addr: nn1 Corresponding IP:host * @Param nn2: nn2_ID * @Param nn2Addr: nn2 Corresponding IP:host * @return: void **/
  def changeHDFSConf(session:SparkSession,nameSpace:String,nn1:String,nn1Addr:String,nn2:String,nn2Addr:String): Unit ={
    

    val sc: SparkContext = session.sparkContext
    sc.hadoopConfiguration.set("fs.defaultFS", s"hdfs://$nameSpace")
    sc.hadoopConfiguration.set("dfs.nameservices", nameSpace)
    sc.hadoopConfiguration.set(s"dfs.ha.namenodes.$nameSpace", s"$nn1,$nn2")
    sc.hadoopConfiguration.set(s"dfs.namenode.rpc-address.$nameSpace.$nn1", nn1Addr)
    sc.hadoopConfiguration.set(s"dfs.namenode.rpc-address.$nameSpace.$nn2", nn2Addr)
    sc.hadoopConfiguration.set(s"dfs.client.failover.proxy.provider.$nameSpace", s"org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider")
  }

Cause analysis ：

1. I'll go first Spark On the running interface of Environment Under the hadoop Parameters of , I searched nn1 Went to have a look , Look at my changenamenode Has the method worked for me

2, result dfs.namenode.http-address.hr-hadoop.nn1 The value of the or node03（ Calculate the cluster ） No node118（ Storage cluster ） Explain that the method is still not effective

Why not take effect ？？？

Configuration conf=new Configuration（）;
Create a Configuration Object time , Its construction method will load... By default hadoop Two configuration files in , Namely hdfs-site.xml as well as core-site.xml, There will be access in these two files hdfs Required parameter values
I have this in my code , Why didn't you load it ？？

3, After analysis, I found that , The code is submitted to the cluster for execution , It loads the... On the cluster core/hdfs-site.xml file , Directly discard the configuration file in the code

Solution ：

1. In the code , Replace the cluster configuration file with your own configuration file , In this way, you can find the information of the storage cluster

 val hadoopConf: Configuration = new Configuration()
    hadoopConf.addResource("hdfs-site.xml")
    hadoopConf.addResource("core-site.xml")

If both configuration resources contain the same configuration item , And the configuration item of the previous resource is not marked as final, that , The following configuration will overwrite the previous configuration . In the example above ,core-site.xml The configuration in will override core-default.xml Configuration with the same name in . If in the first resource （core-default.xml） A configuration item in is marked as final, that , When loading the second resource , There will be a warning .

2, It's not possible to do that just above , It says , Once packaged and run on the cluster , He will put resource Under the folder core/hdfs-site.xml File discard , then .addResource(“hdfs-site.xml”) I can't find my own document , Go to the configuration file of the cluster

3, Put your two configuration files in the execution directory , Submit again spark When the task , Specify in the submission parameters

--files /srv/udp/2.0.0.0/spark/userconf/hdfs-site.xml,/srv/udp/2.0.0.0/spark/userconf/core-site.xml \

4, Expand ：

--files Transferred files ：

If you are in the same cluster as the current submission cluster , It will prompt that the current data source is the same as the target file storage system , The copy is not triggered at this time

INFO Client: Source and destination file systems are the same. Not copying

If you are in a different cluster from the current submission cluster , The source file is updated from the source path to the current file storage system

INFO Client: Uploading resource

原网站

版权声明
本文为[The south wind knows what I mean]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251508041262.html