当前位置:网站首页>Spark's RDD (elastic distributed data set) returns a large result set
Spark's RDD (elastic distributed data set) returns a large result set
2022-07-06 16:37:00 【Ruo Miaoshen】
List of articles
( One ) Too large result set results in Driver End OOM error
A long time ago 《 Learn big data platform from scratch 》 Just mentioned this problem .
adopt collect() Return result set , If the amount of data is too large, it will report driver End memory error .
So here I am , Return tens of millions of data , As a result Out of memory, perhaps GC overhead limit exceeded.
Spark Core elastic distributed data set , about rdd.collect() That's how it's described :
java.util.List collect() //Python The version is similar
Returns an array containing RDD All the elements in
.
remarks
.
This method can only be used when the result set is small , Because all the data will be loaded into driver The memory of the .
( Two ) Various solutions
How can we do if the results are extensive ?
I looked it up on the Internet , Everyone said :
- For large data sets rdd write in HDFS Method of file , such as
rdd.saveAsTextFile,rdd.saveAsNewAPIHadoopFileEach node writes the results to HDFS A directory of ( A pile of documents ). - Not to driver But
rdd.foreach,rdd.foreachPartitionPrint it to the screen . - Through serializable types , Send to the database and so on .
( 3、 ... and ) Return result sets in batches
But I'm lazy , I hope the whole process can switch seamlessly with the previous one ,
So I used collectPartitions, partial Return the data .
Java The code is as follows :
...
if (ColBatch < out_2.getNumPartitions()) {
// If the batch value is set , And less than the number of partitions , Then in batches collect
int[] Par = new int[ColBatch];
for (int i = 0; i < out_2.getNumPartitions(); i += Par.length) {
int ParLen = 0;
for (int j = 0; j < Par.length; j++) {
if (i + j < out_2.getNumPartitions()) {
Par[j] = i + j;
ParLen++;
}
}
TmpLine = String.format(" At present : %d - %d\n", i, i + ParLen - 1);
System.out.print(TmpLine);
List<Tuple2<String, String>>[] output2 = out_2.collectPartitions(Par);
for (int j = 0; j < ParLen; j++) {
for (Tuple2<String, String> tuple : output2[j]) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
}
} else {
List<Tuple2<String, String>> output1 = out_2.collect();
for (Tuple2<String, String> tuple : output1) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
...
PS: but Python Not found in collectPartitions Methods: ??? What about swelling !!! .
边栏推荐
- MariaDB的安装与配置
- 图图的学习笔记-进程
- Kubernetes集群部署
- Market trend report, technological innovation and market forecast of desktop electric tools in China
- Codeforces Round #800 (Div. 2)AC
- (lightoj - 1354) IP checking (Analog)
- 解决Intel12代酷睿CPU【小核载满,大核围观】的问题(WIN11)
- 生成随机密码/验证码
- Research Report of desktop clinical chemical analyzer industry - market status analysis and development prospect prediction
- QT实现窗口置顶、置顶状态切换、多窗口置顶优先关系
猜你喜欢

Codeforces Round #802(Div. 2)A~D

(POJ - 3579) median (two points)

QWidget代码设置样式表探讨

力扣:第81场双周赛

浏览器打印边距,默认/无边距,占满1页A4

QT按钮点击切换QLineEdit焦点(含代码)

Click QT button to switch qlineedit focus (including code)

Chapter 2 shell operation of hfds

Share an example of running dash application in raspberry pie.

Raspberry pie 4b64 bit system installation miniconda (it took a few days to finally solve it)
随机推荐
Installation and configuration of MariaDB
Problem - 1646C. Factorials and Powers of Two - Codeforces
How to insert mathematical formulas in CSDN blog
Installation and use of VMware Tools and open VM tools: solve the problems of incomplete screen and unable to transfer files of virtual machines
875. 爱吃香蕉的珂珂 - 力扣(LeetCode)
js封装数组反转的方法--冯浩的博客
力扣:第81场双周赛
Advancedinstaller installation package custom action open file
< li> dot style list style type
Market trend report, technical innovation and market forecast of tabletop dishwashers in China
Spark独立集群动态上线下线Worker节点
(POJ - 3186) treatments for the cows (interval DP)
简单尝试DeepFaceLab(DeepFake)的新AMP模型
解决Intel12代酷睿CPU单线程只给小核运行的问题
Research Report on market supply and demand and strategy of China's tetraacetylethylenediamine (TAED) industry
QT simulates mouse events and realizes clicking, double clicking, moving and dragging
Generate random password / verification code
Research Report of desktop clinical chemical analyzer industry - market status analysis and development prospect prediction
Discussion on QWidget code setting style sheet
第5章 NameNode和SecondaryNameNode