当前位置:网站首页>MapReduce execution principle record

MapReduce execution principle record

2022-06-26 03:52:00 I love meat

Mapreduce Basic principles this chapter omits

 Insert picture description here
 Insert picture description here

 Some notes :
1  One file Slice multiple split, One split Corresponding to one Maptask. The same one map The output partition corresponds to a reducetask
2 map The data is written to the memory at full percent 80 when , Start writing data to disk . Don't stop when you reach the task , You can continue to write to both memory and disk at the same time .
 If you haven't finished reading it all, you can't write it down 80, At this time, the disk file is also written but only written once 
3  Most of the ( Not all )map It will start when the execution is completed reduce 了 ,
redcue Write to disk if memory is insufficient , Finally, it is merged into a large disk file to execute reduce Business logic 
4 reduce Stage read map Output documents , It determines which data to read by reading the index file 
5 Why  Data sorting  -> 
	 because reduce Phases need to be grouped , take key The same is put together for the specification .map The phases are sorted together to reduce reduce Stage memory sort pressure .
	 For example, in the same partition reduce Aggregation operation , You only need to traverse one at a time key You can successfully aggregate . Out of order, you need to traverse all the files 
6 map spill Three small files are merged into one large file 
7  In the memory is the fast row , Merge files are grouped side by side 
8 map End data has index file ,reduce There is no index file on the end , because reduce The end data is orderly  
( in addition Spark Then for the case that is not a pre aggregation operator and the number of downstream partitions is very small , No memory sort , Improve performance )

Source version 2.7.7

 Insert picture description here

 Submit tasks 
	1  Client resolution MR Mission , Generate some necessary components : The startup script ,job.xml,jar package ( The submission is stored in HDFS In the temporary directory )
	2  Submit a task to RM A proxy object , to RM Send an event program to submit the application . The event contains (jobid,submitDir)
	3 RM Allocate one NodeManager Start the master program MRAppMaster,MRAppMaster Start assigning other NodeManger start-up YarnChild Program execution ,
	MRAppMaster and YarnChild Keep in touch with each other ,
	 If all the programs are successfully executed, the main program will notify MR and client.MR Then release resources ,client The execution is judged to be successful 
	
	MRAppMaster amount to Spark Driver,YarnChild amount to Spark executor

 Insert picture description here

Ring buffer

NodeManager Received MRAppMaster Start after command JVM process , from HDFS Pull various resources to execute MapTask/ReduceTask
 Call partition component , to mapTask Output key-value Mark the partition , Write ring buffer ( Buffer zone 100mb, It has reached percent 80 Write to disk )

 Default 100mb, With equator As a boundary , Write data on the right , Write fixed on the left 4 Byte data index .
 When it is 100% 80, Start writing to disk , The memory is deleted every time the data falls on the disk .
 When writing to disk , Re percent 20 Memory delimitation equator Continue writing memory , If the memory is full again , The disk is blocked before it is written , Up to percent 80 Restore when the disk is written .

MR A major reason for stability : Apply for memory only once and use it all the time , Will not keep applying for new memory space 
 Memory is only constantly overwritten with writes , There is no recycling 

 Before data is written to disk , Will be carried out in quicksort Quick sort , That's percent 80 In memory data location exchange 
1  Sort by partition number first 
2  In the partition , according to key Sort 

 Insert picture description here
 Insert picture description here
Map Merge files

 Insert picture description here
Reduce End shuffle

Reduce End 
 Memory is still 100mb, The threshold for triggering write to disk is 0.66, The available memory threshold is 0.7

 Reading data , If key The same is placed in an intermediate container , Read on to the next key Different or not until next ( Because the data is orderly )

 Insert picture description here

原网站

版权声明
本文为[I love meat]所创,转载请带上原文链接,感谢
https://yzsam.com/2022/177/202206260325048378.html