当前位置:网站首页>Spark的RDD(弹性分布式数据集)返回大结果集
Spark的RDD(弹性分布式数据集)返回大结果集
2022-07-06 09:29:00 【若苗瞬】
(一)过大的结果集导致Driver端OOM错误
很早之前《从零开始学习大数据平台》就提到了这个问题。
通过collect()返回结果集,数据量太大就会报driver端内存错误。
比如我这里,返回千万数据,就导致Out of memory,或者GC overhead limit exceeded。
Spark的核心弹性分布式数据集,对于rdd.collect()是这么描述的:
java.util.List collect() //Python版本差不多的
返回一个数组包含RDD中所有的元素
.
备注
.
这个方法只能在结果集较小的情况下使用,因为所有的数据都会加载到driver的内存中。
(二)各种解决方案
如果结果集大怎能办?
网上查了一圈,大家的说法有:
- 大的数据集用rdd写入HDFS文件的方法,比如
rdd.saveAsTextFile,rdd.saveAsNewAPIHadoopFile各个节点将结果写入到HDFS的目录中(一大堆文件)。 - 不传到driver上而是
rdd.foreach,rdd.foreachPartition打印到屏幕上。 - 通过可序列化的类型,发送到数据库等等。
(三)分批返回结果集
不过我比较懒,希望整个流程和以前无缝切换,
所以我采用了collectPartitions,分批返回数据。
Java代码如下:
...
if (ColBatch < out_2.getNumPartitions()) {
//如果设置了批量值,并小于分区数,则分批collect
int[] Par = new int[ColBatch];
for (int i = 0; i < out_2.getNumPartitions(); i += Par.length) {
int ParLen = 0;
for (int j = 0; j < Par.length; j++) {
if (i + j < out_2.getNumPartitions()) {
Par[j] = i + j;
ParLen++;
}
}
TmpLine = String.format("当前: %d - %d\n", i, i + ParLen - 1);
System.out.print(TmpLine);
List<Tuple2<String, String>>[] output2 = out_2.collectPartitions(Par);
for (int j = 0; j < ParLen; j++) {
for (Tuple2<String, String> tuple : output2[j]) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
}
} else {
List<Tuple2<String, String>> output1 = out_2.collect();
for (Tuple2<String, String> tuple : output1) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
...
PS:但Python中没有找到collectPartitions方法啊???肿么办!!! 。
边栏推荐
- AcWing:第58场周赛
- QT模拟鼠标事件,实现点击双击移动拖拽等
- QWidget代码设置样式表探讨
- Codeforces Round #803 (Div. 2)A~C
- 使用jq实现全选 反选 和全不选-冯浩的博客
- Codeforces - 1526C1&&C2 - Potions
- Sanic异步框架真的这么强吗?实践中找真理
- QT implementation window gradually disappears qpropertyanimation+ progress bar
- Input can only input numbers, limited input
- Configuration du cadre flask loguru log Library
猜你喜欢

SF smart logistics Campus Technology Challenge (no T4)

(lightoj - 1323) billiard balls (thinking)

Codeforces Round #802(Div. 2)A~D

Discussion on QWidget code setting style sheet

新手必会的静态站点生成器——Gridsome

Flag framework configures loguru logstore

Installation and use of VMware Tools and open VM tools: solve the problems of incomplete screen and unable to transfer files of virtual machines

1005. Maximized array sum after K negations

Tree of life (tree DP)

Raspberry pie 4b64 bit system installation miniconda (it took a few days to finally solve it)
随机推荐
Codeforces round 797 (Div. 3) no f
Click QT button to switch qlineedit focus (including code)
Sword finger offer II 019 Delete at most one character to get a palindrome
875. Leetcode, a banana lover
window11 conda安装pytorch过程中遇到的一些问题
(POJ - 3258) River hopper (two points)
Kubernetes集群部署
605. Planting flowers
Opencv learning log 26 -- detect circular holes and mark them
Research Report on market supply and demand and strategy of double drum magnetic separator industry in China
Some problems encountered in installing pytorch in windows11 CONDA
Specify the format time, and fill in zero before the month and days
Openwrt build Hello ipk
Study notes of Tutu - process
Codeforces Round #801 (Div. 2)A~C
顺丰科技智慧物流校园技术挑战赛(无t4)
The "sneaky" new asteroid will pass the earth safely this week: how to watch it
Summary of game theory
AcWing——第55场周赛
Input can only input numbers, limited input