当前位置:网站首页>Spark's RDD (elastic distributed data set) returns a large result set
Spark's RDD (elastic distributed data set) returns a large result set
2022-07-06 16:37:00 【Ruo Miaoshen】
List of articles
( One ) Too large result set results in Driver End OOM error
A long time ago 《 Learn big data platform from scratch 》 Just mentioned this problem .
adopt collect() Return result set , If the amount of data is too large, it will report driver End memory error .
So here I am , Return tens of millions of data , As a result Out of memory, perhaps GC overhead limit exceeded.
Spark Core elastic distributed data set , about rdd.collect() That's how it's described :
java.util.List collect() //Python The version is similar
Returns an array containing RDD All the elements in
.
remarks
.
This method can only be used when the result set is small , Because all the data will be loaded into driver The memory of the .
( Two ) Various solutions
How can we do if the results are extensive ?
I looked it up on the Internet , Everyone said :
- For large data sets rdd write in HDFS Method of file , such as
rdd.saveAsTextFile,rdd.saveAsNewAPIHadoopFileEach node writes the results to HDFS A directory of ( A pile of documents ). - Not to driver But
rdd.foreach,rdd.foreachPartitionPrint it to the screen . - Through serializable types , Send to the database and so on .
( 3、 ... and ) Return result sets in batches
But I'm lazy , I hope the whole process can switch seamlessly with the previous one ,
So I used collectPartitions, partial Return the data .
Java The code is as follows :
...
if (ColBatch < out_2.getNumPartitions()) {
// If the batch value is set , And less than the number of partitions , Then in batches collect
int[] Par = new int[ColBatch];
for (int i = 0; i < out_2.getNumPartitions(); i += Par.length) {
int ParLen = 0;
for (int j = 0; j < Par.length; j++) {
if (i + j < out_2.getNumPartitions()) {
Par[j] = i + j;
ParLen++;
}
}
TmpLine = String.format(" At present : %d - %d\n", i, i + ParLen - 1);
System.out.print(TmpLine);
List<Tuple2<String, String>>[] output2 = out_2.collectPartitions(Par);
for (int j = 0; j < ParLen; j++) {
for (Tuple2<String, String> tuple : output2[j]) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
}
} else {
List<Tuple2<String, String>> output1 = out_2.collect();
for (Tuple2<String, String> tuple : output1) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
...
PS: but Python Not found in collectPartitions Methods: ??? What about swelling !!! .
边栏推荐
- Codeforces Round #800 (Div. 2)AC
- Codeforces Round #798 (Div. 2)A~D
- 图图的学习笔记-进程
- Codeforces - 1526C1&&C2 - Potions
- Codeforces - 1526C1&&C2 - Potions
- (POJ - 1458) common subsequence (longest common subsequence)
- MariaDB的安装与配置
- 分享一个在树莓派运行dash应用的实例。
- Codeforces round 797 (Div. 3) no f
- QT有关QCobobox控件的样式设置(圆角、下拉框,向上展开、可编辑、内部布局等)
猜你喜欢

新手必会的静态站点生成器——Gridsome

QT style settings of qcobobox controls (rounded corners, drop-down boxes, up expansion, editable, internal layout, etc.)

Codeforces Round #802(Div. 2)A~D

QT按钮点击切换QLineEdit焦点(含代码)

浏览器打印边距,默认/无边距,占满1页A4

Install Jupiter notebook under Anaconda

Flag framework configures loguru logstore

第5章 NameNode和SecondaryNameNode

提交Spark应用的若干问题记录(sparklauncher with cluster deploy mode)

去掉input聚焦时的边框
随机推荐
(POJ - 2739) sum of constructive prime numbers (ruler or two points)
Raspberry pie 4b64 bit system installation miniconda (it took a few days to finally solve it)
Codeforces Round #798 (Div. 2)A~D
解决Intel12代酷睿CPU【小核载满,大核围观】的问题(WIN11)
Research Report of desktop clinical chemical analyzer industry - market status analysis and development prospect prediction
Problem - 922D、Robot Vacuum Cleaner - Codeforces
Market trend report, technological innovation and market forecast of desktop electric tools in China
力扣:第81场双周赛
QWidget代码设置样式表探讨
Specify the format time, and fill in zero before the month and days
Remove the border when input is focused
【锟斤拷】的故事:谈谈汉字编码和常用字符集
浏览器打印边距,默认/无边距,占满1页A4
业务系统从Oracle迁移到openGauss数据库的简单记录
(POJ - 3258) River hopper (two points)
Local visualization tools are connected to redis of Alibaba cloud CentOS server
Research Report on market supply and demand and strategy of China's tetraacetylethylenediamine (TAED) industry
useEffect,函數組件掛載和卸載時觸發
SF smart logistics Campus Technology Challenge (no T4)
Study notes of Tutu - process