当前位置:网站首页>记一次Spark报错:Failed to allocate a page (67108864 bytes), try again.
记一次Spark报错:Failed to allocate a page (67108864 bytes), try again.
2022-06-27 06:18:00 【南风知我意丿】
项目场景:
业务那边给了个需求,我们这边完成的话需要两个表进行join操作,小表(4800万条)大表(26亿条)。典型的小表和大表join,首先想到的就是 Broadcast Join对其进行一个最佳处理。
问题描述
1,开干。
//sc是小表
select /*+ BROADCASTJOIN(sc) */
sc.courseid,
csc.courseid
from sale_course sc join course_shopping_cart csc
on sc.courseid=csc.courseid
2,打包上集群run,开始出bug
2022-06-22 19:36:56 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:36:57 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:36:59 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:00 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:00 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:01 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:03 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:03 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:04 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN memory.TaskMemoryManager: Failed to allocate a page (67108864 bytes), try again.
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 2 with no recent heartbeats: 139818 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 5 with no recent heartbeats: 178273 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 7 with no recent heartbeats: 162256 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 WARN spark.HeartbeatReceiver: Removing executor 3 with no recent heartbeats: 154289 ms exceeds timeout 120000 ms
2022-06-22 19:37:05 INFO cluster.YarnClusterSchedulerBackend: Requesting to kill executor(s) 2
3,看了下应该是内存不足,打印一下GC日志再瞅瞅
2022-06-22T19:32:04.731+0800: [GC (Allocation Failure) [PSYoungGen: 994157K->47291K(1377280K)] 1061069K->240591K(4076032K), 0.2125657 secs] [Times: user=4.51 sys=0.35, real=0.21 secs]
2022-06-22T19:32:12.667+0800: [GC (Allocation Failure) [PSYoungGen: 1298524K->69107K(1380352K)] 1491823K->776885K(4079104K), 0.4118997 secs] [Times: user=12.93 sys=1.20, real=0.41 secs]
2022-06-22T19:32:30.661+0800: [GC (Allocation Failure) [PSYoungGen: 1363073K->305779K(1643520K)] 2070852K->1248436K(4342272K), 0.2067380 secs] [Times: user=6.53 sys=0.68, real=0.21 secs]
2022-06-22T19:32:49.327+0800: [GC (Allocation Failure) [PSYoungGen: 1583420K->380843K(1685504K)] 2526077K->1558689K(4384256K), 0.2134726 secs] [Times: user=6.50 sys=1.14, real=0.21 secs]
2022-06-22T19:32:57.628+0800: [GC (Allocation Failure) [PSYoungGen: 1677943K->386985K(1469440K)] 2855790K->1938110K(4168192K), 0.1938505 secs] [Times: user=6.17 sys=0.87, real=0.19 secs]
2022-06-22T19:33:10.943+0800: [GC (Allocation Failure) [PSYoungGen: 1424669K->489773K(1547776K)] 2975793K->2158027K(4246528K), 0.1824065 secs] [Times: user=6.34 sys=0.27, real=0.19 secs]
2022-06-22T19:33:18.556+0800: [GC (Allocation Failure) [PSYoungGen: 1523628K->501866K(1313280K)] 4240457K->3578994K(5061120K), 0.1838270 secs] [Times: user=5.74 sys=0.84, real=0.18 secs]
2022-06-22T19:33:19.956+0800: [GC (Allocation Failure) [PSYoungGen: 1214502K->632842K(1397248K)] 4291630K->3972122K(5145088K), 0.2161871 secs] [Times: user=7.20 sys=0.64, real=0.21 secs]
2022-06-22T19:33:20.172+0800: [Full GC (Ergonomics) [PSYoungGen: 632842K->0K(1397248K)] [ParOldGen: 3339280K->3514303K(4194304K)] 3972122K->3514303K(5591552K), [Metaspace: 136487K->136476K(1177600K)], 0.6284626 secs] [Times: user=6.74 sys=3.98, real=0.63 secs]
2022-06-22T19:33:22.153+0800: [GC (Allocation Failure) [PSYoungGen: 726892K->459232K(1398272K)] 4241195K->3973535K(5592576K), 0.0348947 secs] [Times: user=0.96 sys=0.00, real=0.04 secs]
2022-06-22T19:33:23.347+0800: [GC (Allocation Failure) [PSYoungGen: 1158624K->656153K(1398272K)] 4672927K->4367065K(5592576K), 0.1967581 secs] [Times: user=6.70 sys=0.44, real=0.19 secs]
2022-06-22T19:33:23.544+0800: [Full GC (Ergonomics) [PSYoungGen: 656153K->131072K(1398272K)] [ParOldGen: 3710911K->4169346K(4194304K)] 4367065K->4300418K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 1.7445365 secs] [Times: user=46.91 sys=10.81, real=1.75 secs]
2022-06-22T19:33:26.442+0800: [Full GC (Ergonomics) [PSYoungGen: 830464K->524355K(1398272K)] [ParOldGen: 4169346K->4169283K(4194304K)] 4999810K->4693638K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.5643075 secs] [Times: user=14.75 sys=0.14, real=0.57 secs]
2022-06-22T19:33:27.323+0800: [Full GC (Ergonomics) [PSYoungGen: 664059K->589892K(1398272K)] [ParOldGen: 4169283K->4169282K(4194304K)] 4833342K->4759175K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.3743719 secs] [Times: user=10.16 sys=0.05, real=0.38 secs]
2022-06-22T19:33:27.909+0800: [Full GC (Ergonomics) [PSYoungGen: 699392K->655430K(1398272K)] [ParOldGen: 4169282K->4169282K(4194304K)] 4868674K->4824713K(5592576K), [Metaspace: 136485K->136485K(1177600K)], 0.4272478 secs] [Times: user=11.16 sys=0.05, real=0.43 secs]
2022-06-22T19:33:28.382+0800: [Full GC (Ergonomics) [PSYoungGen: 668779K->655430K(1398272K)] [ParOldGen: 4169282K->4169282K(4194304K)] 4838062K->4824713K(5592576K), [Metaspace: 136486K->136486K(1177600K)], 0.2751700 secs] [Times: user=6.67 sys=0.03, real=0.28 secs]
2022-06-22T19:33:28.657+0800: [Full GC (Allocation Failure) [PSYoungGen: 655430K->655430K(1398272K)] [ParOldGen: 4169282K->4162677K(4194304K)] 4824713K->4818107K(5592576K), [Metaspace: 136486K->135746K(1177600K)], 0.6008903 secs] [Times: user=17.76 sys=0.08, real=0.60 secs]
2022-06-22T19:33:29.260+0800: [Full GC (Ergonomics) [PSYoungGen: 659800K->655438K(1398272K)] [ParOldGen: 4162677K->4162674K(4194304K)] 4822477K->4818112K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 1.4037111 secs] [Times: user=46.99 sys=0.27, real=1.40 secs]
2022-06-22T19:33:30.664+0800: [Full GC (Allocation Failure) [PSYoungGen: 655438K->655431K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4818112K->4818105K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 0.1268273 secs] [Times: user=1.35 sys=0.02, real=0.13 secs]
2022-06-22T19:33:30.792+0800: [Full GC (Ergonomics) [PSYoungGen: 658317K->655447K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4820992K->4818121K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 1.2769239 secs] [Times: user=42.48 sys=0.27, real=1.28 secs]
2022-06-22T19:33:32.069+0800: [Full GC (Allocation Failure) [PSYoungGen: 655447K->655440K(1398272K)] [ParOldGen: 4162674K->4162674K(4194304K)] 4818121K->4818114K(5592576K), [Metaspace: 135746K->135746K(1177600K)], 0.2098295 secs] [Times: user=2.81 sys=0.02, real=0.21 secs]
2022-06-22T19:33:32.282+0800: [Full GC (Ergonomics) [PSYoungGen: 657391K->655457K(1398272K)] [ParOldGen: 4162674K->4162673
原因分析:
其实看到这我就知道问题出现那里了,内存不足,调整下executor内存和driver内存,一般就可以解决了
但是还是在复习一下广播join吧
1.广播join原理
Spark join策略中,如果当一张小表足够小并且可以先缓存到内存中,那么可以使用Broadcast Hash Join,其原理就是先
将小表聚合到driver端,再广播到各个大表分区中,那么再次进行join的时候,就相当于大表的各自分区的数据与小表进行本地join,从而规避了shuffle。
#1,通过参数指定自动广播
广播join默认值为10MB,由spark.sql.autoBroadcastJoinThreshold参数控制。
SparkConf().set("spark.sql.autoBroadcastJoinThreshold","10m") //开启
SparkConf().set("spark.sql.autoBroadcastJoinThreshold","-1") //禁用
#2,强制开启广播join
#SQL Hint方式
#sc 必须是join的小表
select /*+ BROADCASTJOIN(sc) */ 或 /*+ BROADCAST(sc) */ 或 /*+ MAPJOIN(sc) */
2,说下我的问题
上面说了广播join 会把小表的数据 拉到driver段,所以driver的内存不能给的太小了,给太小就会报错
但是,我把driver内存调大之后还是没解决
因为我的小表的数据量太大了,我们集群内存又没办法给太多,无奈
解决方案:
那咋办?
那就别广播join了,就普通join吧虽然慢些 但是硬件资源就在那摆着那也没办法
最后两个表join了两个小时QAQ
边栏推荐
- HTAP 快速上手指南
- 【QT小记】QT元对象系统简单认识
- Add widget on qlistwidgetitem
- Run opcua protocol demo on raspberry pie 4B to access kubeedge
- Download CUDA and cudnn
- 【Cocos Creator 3.5.1】input. Use of on
- How win 10 opens the environment variables window
- 飞行器翼尖加速度和控制面的MPC控制
- [getting started] regular expression Basics
- HTAP 深入探索指南
猜你喜欢

飞行器翼尖加速度和控制面的MPC控制

C Primer Plus 第11章_字符串和字符串函数_代码和练习题

The restart status of the openstack instance will change to the error handling method. The openstack built by the container restarts the compute service method of the computing node and prompts the gi

Us camera cloud service scheme: designed for lightweight video production scenes

IDEA一键生成Log日志

yaml文件加密

使用CSDN 开发云搭建导航网站

Crawler learning 5--- anti crawling identification picture verification code (ddddocr and pyteseract measured effect)

Small program of C language practice (consolidate and deepen the understanding of knowledge points)

使用 WordPress快速个人建站指南
随机推荐
My opinion on test team construction
【QT小记】QT元对象系统简单认识
Multithreading basic part part 1
The form verifies the variables bound to the V-model, and the solution to invalid verification
426-二叉树(513.找树左下角的值、112. 路径总和、106.从中序与后序遍历序列构造二叉树、654. 最大二叉树)
tracepoint
Change the status to the corresponding text during MySQL query
程序猿学习抖音短视频制作
Assembly language - Wang Shuang Chapter 3 notes and experiments
创建一个基础WDM驱动,并使用MFC调用驱动
[QT dot] realize the watchdog function to detect whether the external program is running
[QT notes] basic use of qregularexpression in QT
Configuring the help class iconfiguration in C # NETCORE
LeetCode 0086. Separate linked list
汇编语言-王爽 第13章 int指令-笔记
Assembly language - Wang Shuang Chapter 13 int instruction - Notes
JVM的垃圾回收机制
Scala之偏函数Partial Function
Block level elements & inline elements
Multithreading basic part2