当前位置:网站首页>Spark's RDD (elastic distributed data set) returns a large result set
Spark's RDD (elastic distributed data set) returns a large result set
2022-07-06 16:37:00 【Ruo Miaoshen】
List of articles
( One ) Too large result set results in Driver End OOM error
A long time ago 《 Learn big data platform from scratch 》 Just mentioned this problem .
adopt collect()
Return result set , If the amount of data is too large, it will report driver End memory error .
So here I am , Return tens of millions of data , As a result Out of memory, perhaps GC overhead limit exceeded.
Spark Core elastic distributed data set , about rdd.collect()
That's how it's described :
java.util.List collect() //Python The version is similar
Returns an array containing RDD All the elements in
.
remarks
.
This method can only be used when the result set is small , Because all the data will be loaded into driver The memory of the .
( Two ) Various solutions
How can we do if the results are extensive ?
I looked it up on the Internet , Everyone said :
- For large data sets rdd write in HDFS Method of file , such as
rdd.saveAsTextFile
,rdd.saveAsNewAPIHadoopFile
Each node writes the results to HDFS A directory of ( A pile of documents ). - Not to driver But
rdd.foreach
,rdd.foreachPartition
Print it to the screen . - Through serializable types , Send to the database and so on .
( 3、 ... and ) Return result sets in batches
But I'm lazy , I hope the whole process can switch seamlessly with the previous one ,
So I used collectPartitions
, partial Return the data .
Java The code is as follows :
...
if (ColBatch < out_2.getNumPartitions()) {
// If the batch value is set , And less than the number of partitions , Then in batches collect
int[] Par = new int[ColBatch];
for (int i = 0; i < out_2.getNumPartitions(); i += Par.length) {
int ParLen = 0;
for (int j = 0; j < Par.length; j++) {
if (i + j < out_2.getNumPartitions()) {
Par[j] = i + j;
ParLen++;
}
}
TmpLine = String.format(" At present : %d - %d\n", i, i + ParLen - 1);
System.out.print(TmpLine);
List<Tuple2<String, String>>[] output2 = out_2.collectPartitions(Par);
for (int j = 0; j < ParLen; j++) {
for (Tuple2<String, String> tuple : output2[j]) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
}
} else {
List<Tuple2<String, String>> output1 = out_2.collect();
for (Tuple2<String, String> tuple : output1) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
...
PS: but Python Not found in collectPartitions
Methods: ??? What about swelling !!! .
边栏推荐
- 第6章 Rebalance详解
- Codeforces Round #801 (Div. 2)A~C
- QNetworkAccessManager实现ftp功能总结
- 第一章 MapReduce概述
- 业务系统兼容数据库Oracle/PostgreSQL(openGauss)/MySQL的琐事
- Spark独立集群Worker和Executor的概念
- Chapter 5 detailed explanation of consumer groups
- It is forbidden to trigger onchange in antd upload beforeupload
- 软通乐学-js求字符串中字符串当中那个字符出现的次数多 -冯浩的博客
- 力扣——第298场周赛
猜你喜欢
使用jq实现全选 反选 和全不选-冯浩的博客
QT style settings of qcobobox controls (rounded corners, drop-down boxes, up expansion, editable, internal layout, etc.)
Chapter 2 shell operation of hfds
Flask框架配置loguru日志库
Install Jupiter notebook under Anaconda
第 300 场周赛 - 力扣(LeetCode)
Summary of game theory
拉取分支失败,fatal: ‘origin/xxx‘ is not a commit and a branch ‘xxx‘ cannot be created from it
QT simulates mouse events and realizes clicking, double clicking, moving and dragging
Problem - 922D、Robot Vacuum Cleaner - Codeforces
随机推荐
Codeforces round 797 (Div. 3) no f
Kubernetes集群部署
Educational Codeforces Round 130 (Rated for Div. 2)A~C
Flask框架配置loguru日志庫
875. Leetcode, a banana lover
Installation and use of VMware Tools and open VM tools: solve the problems of incomplete screen and unable to transfer files of virtual machines
Codeforces Round #798 (Div. 2)A~D
QT realizes window topping, topping state switching, and multi window topping priority relationship
Problem - 1646C. Factorials and Powers of Two - Codeforces
第6章 DataNode
Browser print margin, default / borderless, full 1 page A4
VMware Tools和open-vm-tools的安装与使用:解决虚拟机不全屏和无法传输文件的问题
SF smart logistics Campus Technology Challenge (no T4)
(lightoj - 1323) billiard balls (thinking)
Install Jupiter notebook under Anaconda
Chapter 5 detailed explanation of consumer groups
Read and save zarr files
Codeforces Round #802(Div. 2)A~D
Classic application of stack -- bracket matching problem
浏览器打印边距,默认/无边距,占满1页A4