当前位置:网站首页>Spark's RDD (elastic distributed data set) returns a large result set
Spark's RDD (elastic distributed data set) returns a large result set
2022-07-06 16:37:00 【Ruo Miaoshen】
List of articles
( One ) Too large result set results in Driver End OOM error
A long time ago 《 Learn big data platform from scratch 》 Just mentioned this problem .
adopt collect()
Return result set , If the amount of data is too large, it will report driver End memory error .
So here I am , Return tens of millions of data , As a result Out of memory, perhaps GC overhead limit exceeded.
Spark Core elastic distributed data set , about rdd.collect()
That's how it's described :
java.util.List collect() //Python The version is similar
Returns an array containing RDD All the elements in
.
remarks
.
This method can only be used when the result set is small , Because all the data will be loaded into driver The memory of the .
( Two ) Various solutions
How can we do if the results are extensive ?
I looked it up on the Internet , Everyone said :
- For large data sets rdd write in HDFS Method of file , such as
rdd.saveAsTextFile
,rdd.saveAsNewAPIHadoopFile
Each node writes the results to HDFS A directory of ( A pile of documents ). - Not to driver But
rdd.foreach
,rdd.foreachPartition
Print it to the screen . - Through serializable types , Send to the database and so on .
( 3、 ... and ) Return result sets in batches
But I'm lazy , I hope the whole process can switch seamlessly with the previous one ,
So I used collectPartitions
, partial Return the data .
Java The code is as follows :
...
if (ColBatch < out_2.getNumPartitions()) {
// If the batch value is set , And less than the number of partitions , Then in batches collect
int[] Par = new int[ColBatch];
for (int i = 0; i < out_2.getNumPartitions(); i += Par.length) {
int ParLen = 0;
for (int j = 0; j < Par.length; j++) {
if (i + j < out_2.getNumPartitions()) {
Par[j] = i + j;
ParLen++;
}
}
TmpLine = String.format(" At present : %d - %d\n", i, i + ParLen - 1);
System.out.print(TmpLine);
List<Tuple2<String, String>>[] output2 = out_2.collectPartitions(Par);
for (int j = 0; j < ParLen; j++) {
for (Tuple2<String, String> tuple : output2[j]) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
}
} else {
List<Tuple2<String, String>> output1 = out_2.collect();
for (Tuple2<String, String> tuple : output1) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
...
PS: but Python Not found in collectPartitions
Methods: ??? What about swelling !!! .
边栏推荐
- Codeforces round 797 (Div. 3) no f
- Problem - 922D、Robot Vacuum Cleaner - Codeforces
- MariaDB的安装与配置
- Li Kou: the 81st biweekly match
- 树莓派4B64位系统安装miniconda(折腾了几天终于解决)
- Summary of FTP function implemented by qnetworkaccessmanager
- Bisphenol based CE Resin Industry Research Report - market status analysis and development prospect forecast
- Codeforces Round #803 (Div. 2)A~C
- 两个礼拜速成软考中级软件设计师经验
- Research Report on market supply and demand and strategy of double drum magnetic separator industry in China
猜你喜欢
Raspberry pie 4b64 bit system installation miniconda (it took a few days to finally solve it)
Chapter 7__ consumer_ offsets topic
Advancedinstaller installation package custom action open file
解决Intel12代酷睿CPU单线程只给小核运行的问题
Configuration du cadre flask loguru log Library
VMware Tools和open-vm-tools的安装与使用:解决虚拟机不全屏和无法传输文件的问题
Codeforces Round #797 (Div. 3)无F
树莓派4B64位系统安装miniconda(折腾了几天终于解决)
It is forbidden to trigger onchange in antd upload beforeupload
Advancedinstaller安装包自定义操作打开文件
随机推荐
力扣leetcode第 280 场周赛
Chapter 5 namenode and secondarynamenode
Research Report on market supply and demand and strategy of Chinese table lamp industry
< li> dot style list style type
分享一个在树莓派运行dash应用的实例。
It is forbidden to trigger onchange in antd upload beforeupload
使用jq实现全选 反选 和全不选-冯浩的博客
(lightoj - 1369) answering queries (thinking)
Research Report on market supply and demand and strategy of China's tetraacetylethylenediamine (TAED) industry
pytorch提取骨架(可微)
Codeforces round 797 (Div. 3) no f
顺丰科技智慧物流校园技术挑战赛(无t4)
Install Jupiter notebook under Anaconda
Acwing - game 55 of the week
Market trend report, technological innovation and market forecast of desktop electric tools in China
Codeforces Round #797 (Div. 3)无F
QT实现圆角窗口
Installation and use of VMware Tools and open VM tools: solve the problems of incomplete screen and unable to transfer files of virtual machines
(POJ - 3579) median (two points)
解决Intel12代酷睿CPU单线程调度问题(二)