当前位置:网站首页>Why rdd/dataset is needed in spark
Why rdd/dataset is needed in spark
2022-07-06 21:43:00 【Big data Xiaochen】
No, RDD、Dataset Before , do wordcount Or other big data computing :
Native python aggregate : such as python Of list、set、map, But only stand-alone version is supported , Distributed... Is not supported . If you want to do Distributed Computing , A lot of extra work needs to be done , Like threads / Process communication , Fault tolerance , Automatic load balancing .. It's troublesome to wait . So the framework was born .
You can also use MapReduce: Inefficient operation , Development efficiency is also low .
The birth of Spark/Flink, Reference to native scala The design of the collection , Abstract out new data types -RDD/Dataset
RDD It's actually a distributed collection , Support function operation .. It is as simple to use as a local collection . Fast development speed . The bottom layer is based on distributed memory computing , yes MR Of 100 times .
RDD What is it
Elastic distributed data sets
elastic :【 Memory 】 and 【CPU】 Can be extended . Intermediate data exists 【 Memory 】, If there's not enough memory , Can overflow 【 disk 】
Distributed :【 Storage 】 and 【 Calculation 】 Are distributed on multiple nodes
Data sets : A large Abstract container , Use it to follow python Collection is as simple , Support 【 Functional expression 】 Programming .
Core design points
immutable : The elements of the set cannot be changed inside , But it can be converted into a new set .
Divisible :RDD Divided into several parts .
Parallel computing : Each partition is handled by a task , Each task is calculated in parallel .
Recorded in the source code RDD Of 5 Big characteristic
There is a partition list : take 【 All data 】 Divide into a reasonable number of partitions .
Calculation function : Each partition has 【 function 】
Dependency list :RDD Convert to a new RDD, Dependencies are also recorded .
【 Optional 】 Comparator : When RDD The element is 【 Key value pair 】 when , You can specify a partition , Specifies how to press key To group into different partitions . The default is Hash Comparator
【 Optional 】 The best position : The best position of calculation is recorded ( Moving code is more cost-effective than moving data ) such as HDFS Of block Location .
RDD Of 5 The big feature actually shows :
Where is the data ? Where to calculate ? What are the divisions ? What partition to use ? What function is used to calculate ?
边栏推荐
- The use method of string is startwith () - start with XX, endswith () - end with XX, trim () - delete spaces at both ends
- C# 如何在dataGridView里设置两个列comboboxcolumn绑定级联事件的一个二级联动效果
- Binary tree node at the longest distance
- 20220211 failure - maximum amount of data supported by mongodb
- [interpretation of the paper] machine learning technology for Cataract Classification / classification
- Dialogue with Jia Yangqing, vice president of Alibaba: pursuing a big model is not a bad thing
- Nodejs tutorial expressjs article quick start
- C语言:#if、#def和#ifndef综合应用
- R语言做文本挖掘 Part4文本分类
- WEB功能测试说明
猜你喜欢
Enhance network security of kubernetes with cilium
Absolute primes (C language)
Why does MySQL index fail? When do I use indexes?
C# 如何在dataGridView里设置两个列comboboxcolumn绑定级联事件的一个二级联动效果
红杉中国,刚刚募资90亿美元
Shake Sound poussera l'application indépendante de plantation d'herbe "louable", les octets ne peuvent pas oublier le petit livre rouge?
Efficiency tool +wps check box shows the solution to the sun problem
HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother
JPEG2000-Matlab源码实现
中国白酒的5场大战
随机推荐
JS according to the Chinese Alphabet (province) or according to the English alphabet - Za sort &az sort
WEB功能测试说明
Reinforcement learning - learning notes 5 | alphago
Ravendb starts -- document metadata
Binary tree node at the longest distance
R language for text mining Part4 text classification
麦趣尔砸了小众奶招牌
HMS core machine learning service creates a new "sound" state of simultaneous interpreting translation, and AI makes international exchanges smoother
基于InsightFace的高精度人脸识别,可直接对标虹软
uni-app App端半屏连续扫码
string的底层实现
在Pi和Jetson nano上运行深度网络,程序被Killed
PostgreSQL 安装gis插件 CREATE EXTENSION postgis_topology
Quick news: the flybook players' conference is held online; Wechat payment launched "education and training service toolbox"
[Li Kou brushing questions] one dimensional dynamic planning record (53 change exchanges, 300 longest increasing subsequence, 53 largest subarray and)
@Detailed differences among getmapping, @postmapping and @requestmapping, with actual combat code (all)
抖音將推獨立種草App“可頌”,字節忘不掉小紅書?
ViT论文详解
SDL2来源分析7:演出(SDL_RenderPresent())
【Redis设计与实现】第一部分 :Redis数据结构和对象 总结