当前位置:网站首页>Spark's RDD (elastic distributed data set) returns a large result set
Spark's RDD (elastic distributed data set) returns a large result set
2022-07-06 16:37:00 【Ruo Miaoshen】
List of articles
( One ) Too large result set results in Driver End OOM error
A long time ago 《 Learn big data platform from scratch 》 Just mentioned this problem .
adopt collect()
Return result set , If the amount of data is too large, it will report driver End memory error .
So here I am , Return tens of millions of data , As a result Out of memory, perhaps GC overhead limit exceeded.
Spark Core elastic distributed data set , about rdd.collect()
That's how it's described :
java.util.List collect() //Python The version is similar
Returns an array containing RDD All the elements in
.
remarks
.
This method can only be used when the result set is small , Because all the data will be loaded into driver The memory of the .
( Two ) Various solutions
How can we do if the results are extensive ?
I looked it up on the Internet , Everyone said :
- For large data sets rdd write in HDFS Method of file , such as
rdd.saveAsTextFile
,rdd.saveAsNewAPIHadoopFile
Each node writes the results to HDFS A directory of ( A pile of documents ). - Not to driver But
rdd.foreach
,rdd.foreachPartition
Print it to the screen . - Through serializable types , Send to the database and so on .
( 3、 ... and ) Return result sets in batches
But I'm lazy , I hope the whole process can switch seamlessly with the previous one ,
So I used collectPartitions
, partial Return the data .
Java The code is as follows :
...
if (ColBatch < out_2.getNumPartitions()) {
// If the batch value is set , And less than the number of partitions , Then in batches collect
int[] Par = new int[ColBatch];
for (int i = 0; i < out_2.getNumPartitions(); i += Par.length) {
int ParLen = 0;
for (int j = 0; j < Par.length; j++) {
if (i + j < out_2.getNumPartitions()) {
Par[j] = i + j;
ParLen++;
}
}
TmpLine = String.format(" At present : %d - %d\n", i, i + ParLen - 1);
System.out.print(TmpLine);
List<Tuple2<String, String>>[] output2 = out_2.collectPartitions(Par);
for (int j = 0; j < ParLen; j++) {
for (Tuple2<String, String> tuple : output2[j]) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
}
} else {
List<Tuple2<String, String>> output1 = out_2.collect();
for (Tuple2<String, String> tuple : output1) {
oF002.write(String.format("%s|%s\n", tuple._1(), tuple._2()));
g002++;
}
}
...
PS: but Python Not found in collectPartitions
Methods: ??? What about swelling !!! .
边栏推荐
- 第7章 __consumer_offsets topic
- 两个礼拜速成软考中级软件设计师经验
- Double specific tyrosine phosphorylation regulated kinase 1A Industry Research Report - market status analysis and development prospect prediction
- QT implementation window gradually disappears qpropertyanimation+ progress bar
- Configuration du cadre flask loguru log Library
- Li Kou: the 81st biweekly match
- Market trend report, technological innovation and market forecast of desktop electric tools in China
- 树莓派4B64位系统安装miniconda(折腾了几天终于解决)
- Tert butyl hydroquinone (TBHQ) Industry Research Report - market status analysis and development prospect forecast
- Pull branch failed, fatal: 'origin/xxx' is not a commit and a branch 'xxx' cannot be created from it
猜你喜欢
QT implementation fillet window
Remove the border when input is focused
Codeforces Round #801 (Div. 2)A~C
新手必会的静态站点生成器——Gridsome
读取和保存zarr文件
Flask框架配置loguru日志库
The "sneaky" new asteroid will pass the earth safely this week: how to watch it
300th weekly match - leetcode
antd upload beforeUpload中禁止触发onchange
Spark独立集群动态上线下线Worker节点
随机推荐
第5章 消费者组详解
Codeforces Round #801 (Div. 2)A~C
Educational Codeforces Round 130 (Rated for Div. 2)A~C
第2章 HFDS的Shell操作
< li> dot style list style type
Market trend report, technical innovation and market forecast of tabletop dishwashers in China
OneForAll安装使用
(lightoj - 1370) Bi shoe and phi shoe (Euler function tabulation)
Li Kou: the 81st biweekly match
QT实现窗口渐变消失QPropertyAnimation+进度条
Kubernetes集群部署
Codeforces Round #798 (Div. 2)A~D
Read and save zarr files
Codeforces Round #771 (Div. 2)
使用jq实现全选 反选 和全不选-冯浩的博客
How to insert mathematical formulas in CSDN blog
Chapter 2 shell operation of hfds
Acwing: Game 58 of the week
Acwing - game 55 of the week
Input can only input numbers, limited input