当前位置：网站首页>Once spark reported an error: failed to allocate a page (67108864 bytes), try again

Once spark reported an error: failed to allocate a page (67108864 bytes), try again

2022-07-25 15:15:00 【The south wind knows what I mean】

Project scenario ：

There is a demand from the business side , We need two tables to complete join operation , Watch （4800 Ten thousand ） The big table （26 Billion bars ）. Typical small and large watches join, The first thing that comes to mind Broadcast Join Make the best of it .

Problem description

1, Open the door .

//sc It's a small table. 
select /*+ BROADCASTJOIN(sc) */ 
  sc.courseid,
  csc.courseid
from sale_course sc join course_shopping_cart csc
on sc.courseid=csc.courseid

2, Pack cluster run, Start to bug

2022-06-22 19:36:56 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:36:57 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:36:59 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:00 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:00 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:03 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:03 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:04 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 139818 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 5 with no recent heartbeats: 178273 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 7 with no recent heartbeats: 162256 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 3 with no recent heartbeats: 154289 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 2

3, After reading it, I think there is insufficient memory , A print GC Look at the log again

2022-06-22T19:32:04.731+0800: [GC (Allocation Failure) [PSYoungGen: 994157K->47291K(1377280K)] 1061069K->240591K(4076032K), 0.2125657 secs] [Times: user=4.51 sys=0.35, real=0.21 secs] 
2022-06-22T19:32:12.667+0800: [GC (Allocation Failure) [PSYoungGen: 1298524K->69107K(1380352K)] 1491823K->776885K(4079104K), 0.4118997 secs] [Times: user=12.93 sys=1.20, real=0.41 secs] 
2022-06-22T19:32:30.661+0800: [GC (Allocation Failure) [PSYoungGen: 1363073K->305779K(1643520K)] 2070852K->1248436K(4342272K), 0.2067380 secs] [Times: user=6.53 sys=0.68, real=0.21 secs] 
2022-06-22T19:32:49.327+0800: [GC (Allocation Failure) [PSYoungGen: 1583420K->380843K(1685504K)] 2526077K->1558689K(4384256K), 0.2134726 secs] [Times: user=6.50 sys=1.14, real=0.21 secs] 
2022-06-22T19:32:57.628+0800: [GC (Allocation Failure) [PSYoungGen: 1677943K->386985K(1469440K)] 2855790K->1938110K(4168192K), 0.1938505 secs] [Times: user=6.17 sys=0.87, real=0.19 secs] 
2022-06-22T19:33:10.943+0800: [GC (Allocation Failure) [PSYoungGen: 1424669K->489773K(1547776K)] 2975793K->2158027K(4246528K), 0.1824065 secs] [Times: user=6.34 sys=0.27, real=0.19 secs] 
2022-06-22T19:33:18.556+0800: [GC (Allocation Failure) [PSYoungGen: 1523628K->501866K(1313280K)] 4240457K->3578994K(5061120K), 0.1838270 secs] [Times: user=5.74 sys=0.84, real=0.18 secs] 
2022-06-22T19:33:19.956+0800: [GC (Allocation Failure) [PSYoungGen: 1214502K->632842K(1397248K)] 4291630K->3972122K(5145088K), 0.2161871 secs] [Times: user=7.20 sys=0.64, real=0.21 secs] 
2022-06-22T19:33:20.172+0800: [Full GC (Ergonomics) [PSYoungGen: 632842K->0K(1397248K)] [ParOldGen: 3339280K->3514303K(4194304K)] 3972122K->3514303K(5591552K), [Metaspace: 136487K->136476K(1177600K)], 0.6284626 secs] [Times: user=6.74 sys=3.98, real=0.63 secs] 
2022-06-22T19:33:22.153+0800: [GC (Allocation Failure) [PSYoungGen: 726892K->459232K(1398272K)] 4241195K->3973535K(5592576K), 0.0348947 secs] [Times: user=0.96 sys=0.00, real=0.04 secs] 
2022-06-22T19:33:23.347+0800: [GC (Allocation Failure) [PSYoungGen: 1158624K->656153K(1398272K)] 4672927K->4367065K(5592576K), 0.1967581 secs] [Times: user=6.70 sys=0.44, real=0.19 secs] 
2022-06-22T19:33:23.544+0800: [Full GC (Ergonomics) [PSYoungGen: 656153K->131072K(1398272K)] [ParOldGen: 3710911K->4169346K(4194304K)] 4367065K->4300418K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 1.7445365 secs] [Times: user=46.91 sys=10.81, real=1.75 secs] 
2022-06-22T19:33:26.442+0800: [Full GC (Ergonomics) [PSYoungGen: 830464K->524355K(1398272K)] [ParOldGen: 4169346K->4169283K(4194304K)] 4999810K->4693638K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.5643075 secs] [Times: user=14.75 sys=0.14, real=0.57 secs] 
2022-06-22T19:33:27.323+0800: [Full GC (Ergonomics) [PSYoungGen: 664059K->589892K(1398272K)] [ParOldGen: 4169283K->4169282K(4194304K)] 4833342K->4759175K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.3743719 secs] [Times: user=10.16 sys=0.05, real=0.38 secs] 
2022-06-22T19:33:27.909+0800: [Full GC (Ergonomics) [PSYoungGen: 699392K->655430K(1398272K)] [ParOldGen: 4169282K->4169282K(4194304K)] 4868674K->4824713K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.4272478 secs] [Times: user=11.16 sys=0.05, real=0.43 secs] 
2022-06-22T19:33:28.382+0800: [Full GC (Ergonomics) [PSYoungGen: 668779K->655430K(1398272K)] [ParOldGen: 4169282K->4169282K(4194304K)] 4838062K->4824713K(5592576K), [Metaspace: 136486K->136486K(1177600K)], 0.2751700 secs] [Times: user=6.67 sys=0.03, real=0.28 secs] 
2022-06-22T19:33:28.657+0800: [Full GC (Allocation Failure) [PSYoungGen: 655430K->655430K(1398272K)] [ParOldGen: 4169282K->4162677K(4194304K)] 4824713K->4818107K(5592576K), [Metaspace: 136486K->135746K(1177600K)], 0.6008903 secs] [Times: user=17.76 sys=0.08, real=0.60 secs] 
2022-06-22T19:33:29.260+0800: [Full GC (Ergonomics) [PSYoungGen: 659800K->655438K(1398272K)] [ParOldGen: 4162677K->4162674K(4194304K)] 4822477K->4818112K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 1.4037111 secs] [Times: user=46.99 sys=0.27, real=1.40 secs] 
2022-06-22T19:33:30.664+0800: [Full GC (Allocation Failure) [PSYoungGen: 655438K->655431K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4818112K->4818105K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 0.1268273 secs] [Times: user=1.35 sys=0.02, real=0.13 secs] 
2022-06-22T19:33:30.792+0800: [Full GC (Ergonomics) [PSYoungGen: 658317K->655447K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4820992K->4818121K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 1.2769239 secs] [Times: user=42.48 sys=0.27, real=1.28 secs] 
2022-06-22T19:33:32.069+0800: [Full GC (Allocation Failure) [PSYoungGen: 655447K->655440K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4818121K->4818114K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 0.2098295 secs] [Times: user=2.81 sys=0.02, real=0.21 secs] 
2022-06-22T19:33:32.282+0800: [Full GC (Ergonomics) [PSYoungGen: 657391K->655457K(1398272K)] [ParOldGen: 4162674K->4162673

Cause analysis ：

In fact, seeing this, I know where the problem is , Out of memory , Under adjustment executor Memory and driver Memory , Generally, it can be solved
But I'm still reviewing the radio join Well

1. radio broadcast join principle

Spark join Strategy , If a small table is small enough and can be cached into memory first , Then you can use Broadcast Hash Join, The principle is to Aggregate small tables into driver End , Then broadcast to each large table partition , So do it again join When , Compare the data of each partition of the large table with the small table locally join, Thus avoiding shuffle.

#1, Specify auto broadcast by parameter 
 radio broadcast join The default value is 10MB, from spark.sql.autoBroadcastJoinThreshold Parameter control .
SparkConf().set("spark.sql.autoBroadcastJoinThreshold","10m")  // Turn on 
SparkConf().set("spark.sql.autoBroadcastJoinThreshold","-1")   // Ban 

#2, Forcibly turn on the broadcast join
#SQL Hint The way 
#sc  Must be join My little watch 
select /*+ BROADCASTJOIN(sc) */   or  /*+ BROADCAST(sc) */  or  /*+ MAPJOIN(sc) */

2, Tell me about my problem

It says radio join The data of the small table Pull to driver paragraph , therefore driver Memory cannot be too small , If you give too little, you will report an error
however , I put driver The problem is still unsolved after the memory is increased
Because my small table has too much data , We can't give too much memory to the cluster , but

Solution ：

To do that ？
Then don't broadcast join 了 , Just ordinary join Well, it's slower But the hardware resources are there. There is no way
The last two tables join For two hours QAQ

原网站

版权声明
本文为[The south wind knows what I mean]所创，转载请带上原文链接，感谢
https://yzsam.com/2022/206/202207251508041191.html