当前位置：网站首页>MapReduce & yarn theory

MapReduce & yarn theory

2022-06-26 09:38:00 【[email protected]】

MapReduce&Yarn theory

Preface
MapReduce
Yarn
- Yarn framework
- Yran Execute the process
Reference resources

Preface

install zookeeper Please refer to Linux - zookeeper Cluster building
zookeeper For basic use, please refer to zookeeper Command and API
Hadoop For theoretical study, please refer to Hadoop theory
HDFS For theoretical study, please refer to HDFS theory
install HDFS Please refer to Linux -HDFS Deploy
HDFS For basic use, please refer to HDFS Command and API

MapReduce

Hadoop MapReduce It is the data processing layer . It handles storage in HDFS A large amount of structured and unstructured data in
MapReduce The data is processed in parallel by dividing the job into a set of independent tasks . therefore , Parallel processing improves speed and reliability

Hadoop MapReduce Data processing is carried out in two stages Map and Reduce Stage ：

map Stage —— This is the first stage of data processing . At this stage , We specify all the complex logic / Business rules / Expensive code
Reduce Stage —— This is the second stage of processing . At this stage , We specify lightweight processing , Such as polymerization / Sum up

MapReduce framework

Insert picture description here

MRv1 role

Client

Work as a unit
Planning job calculation distribution
Submit assignment resources to HDFS
Finally submit the assignment to JobTracker

JobTracker

The core , Lord , Single point
Schedule all jobs
Monitor the resource load of the whole cluster

TaskTracker

from , Own node resource management
and JobTracker heartbeat , Reporting resources , obtain Task

disadvantages

JobTracker： Too much load , A single point of failure
Resource management and computing scheduling are strongly coupled , Other computing frameworks need to implement resource management repeatedly
Different frameworks cannot manage resources globally

MapReduce Execute the process

Insert picture description here

Client submits a MapReduce Of jar Give it to JobClient( submission ：hadoop jar …)
JobClient adopt RPC and JobTracker） communicate , Return a store jar The address of the package （HDFS） and JobId
Client The resources needed to run the job （ Include JAR file 、 Configuration files and calculated input shards ） Copied to the HDFS Work in id Named directory (Path = HDFS Address on + JobId)
Start submitting tasks ( Description of the task , No jar, Include Jobid,jar Storage location , Configuration information, etc )
JobTracker Perform the initialization task
Read HDFS Files to be processed on , Start calculating input slices , Each slice corresponds to one MapperTask
TaskTracker Get the task through the heartbeat mechanism （ Description of the task ）
Download what you need jar, Configuration files, etc
TaskTracker Start a Java Child Subprocesses
To perform specific tasks （MapperTask or ReducerTask） Writes the result to HDFS among

MapReduce Workflow

Insert picture description here

InputFiles - File input
Store... In the input file MapReduce Job data . stay HDFS in , Input file resides , The input file format is arbitrary , You can also use line based log files and binary formats
InputFormat - File format
InputFormat Define how to split and read these input files , It selects the file or other object to import ,InputFormat establish InputSplit
Splits - section
It means that a single Mapper Data processed . For each split , Will create a map Mission , therefore map The number of tasks is equal to Split The number of , The framework is divided into records , from Mapper Handle
RecordReader - Record reading
It is associated with Split signal communication . Then convert the data to fit Mapper Read key value pairs ,RecordReader By default TextInputFormat Convert data to key value pairs , Until the file is read , It assigns byte offsets to each line that exists in the file , These key value pairs are then sent further to Mapper Further treatment
Mapper - mapping
It deals with RecordReader Generated input record , And generate intermediate key value pairs , The intermediate output is completely different from the input pair ,Mapper The output of is the complete set of key value pairs
Hadoop The framework will not mapper The output of is stored in HDFS On , Because the data is temporary , write in HDFS Unnecessary multiple copies will be created , then Mapper Pass the output to Combiner Further processing
Combiner - Merge
Combiner yes Mini-reducer, It's right Mapper The output of performs local aggregation , It minimizes Mapper and Reducer Data transfer between , therefore , When the merge function is complete , The framework passes the output to Partitioner Further processing
Partitioner - Partition
If we use multiple Reducer when ,Partitioner Will appear , It gets Combiner And execute partition
according to MapReduce To partition the output , By hash function ,key( or key Subset ) Derived partition
according to MapReduce Key value in , For each Combiner Output partitioning , Then records with the same key value enter the same partition , Then each partition is sent to Reducer ,MapReduce Partitions in execution allow Mapper The output is evenly distributed to Reducer On
Shuffling and Sorting - Shuffle and sort
After partition , The output is transferred to Reducer node , Shuffling is the physical movement of data over the network , When all Mapper After completion , stay Reducer Adjust the output , then , The framework merges and sorts these intermediate outputs , And then I'm going to take that as Reducer The input of
Reducer - reduction
then Reducer take Mapper Generate a set of intermediate key value pairs as input , Then run one of them Reducer Function to generate output ,Reducer The output of is the final output , The framework then stores the output in HDFS On
RecordWriter - Record write
It converts these output key value pairs from Reducer Stage write output file
OutputFormat - Output format
OutputFormat Defined RecordReader How to write these output key value pairs in the output file ,Hadoop Provide examples to HDFS Write files in , therefore OutputFormat Examples will Reducer The final output of is written to HDFS

Yarn

Hadoop 2.0 The newly introduced resource management system , from MRv1 Evolved from
YARN from ResourceManager、NodeManager And for each application ApplicationMaster form
take MRv1 in JobTracker Resource management and task scheduling are separated from each other , By ResourceManager and ApplicationMaster Process implementation

Yarn framework

Insert picture description here

MRv2 role

Client - client ： Submit MapReduce Homework
Resource Manager - Explorer ： It is YARN The main daemon of , Responsible for resource allocation and management among all applications . Whenever a processing request is received , It will forward it to the corresponding node manager , And allocate corresponding resources to complete the request . It has two main components

Scheduler - Scheduler ： It performs scheduling based on allocated applications and available resources . It is a pure scheduler , Means that it does not perform other tasks , Such as monitoring or tracking , And there is no guarantee that the task will restart if it fails .YARN The scheduler supports the use of Capacity scheduler、Fair scheduler And other plug-ins to partition cluster resources
Application Manager - Application Manager ： It is responsible for accepting applications and negotiating from ResourceManager The first container of . If the task fails , It will also restart Application Master Containers

Node Manager - Node manager ： It is responsible for Hadoop A single node on the cluster , And manage applications and workflows as well as specific nodes , The main work is with ResourceManager Keep in sync , towards ResourceManager register , And send heartbeat with node health status , Monitor resource usage , Perform log management , And according to ResourceManager To kill the container , Also responsible for creating container processes , And start it according to the request of the main process of the application
Application Master - Applications ：Application Is a single job submitted to the framework ,Application Responsible for working with ResourceManager Negotiate resources , Track single Application Status and progress of ,Application Start the context by sending a container that contains everything the application needs to run (container Launch Context, CLC) from NodeManager Request container , After launching the application , It will from time to time ResourceManager Send health report
Container - Containers ： It is on a single node RAM、CPU Collection of physical resources such as kernel and disk , The container is started by the container context （CLC） call , This is a containing environment variables 、 Security Token 、 Records of dependencies and other information

Yran Execute the process

Insert picture description here

The client submits the application
The resource manager assigns a container to start the application manager
The application manager registers itself with the resource manager
Application Manager negotiates containers from resource manager
The application manager notifies the node manager to start the container
The application code executes in the container
The client contacts the resource manager / Application manager to monitor the status of applications
After processing , Application Manager unregisters resource manager