当前位置:网站首页>Why rdd/dataset is needed in spark
Why rdd/dataset is needed in spark
2022-07-06 21:43:00 【Big data Xiaochen】
No, RDD、Dataset Before , do wordcount Or other big data computing :
Native python aggregate : such as python Of list、set、map, But only stand-alone version is supported , Distributed... Is not supported . If you want to do Distributed Computing , A lot of extra work needs to be done , Like threads / Process communication , Fault tolerance , Automatic load balancing .. It's troublesome to wait . So the framework was born .
You can also use MapReduce: Inefficient operation , Development efficiency is also low .
The birth of Spark/Flink, Reference to native scala The design of the collection , Abstract out new data types -RDD/Dataset
RDD It's actually a distributed collection , Support function operation .. It is as simple to use as a local collection . Fast development speed . The bottom layer is based on distributed memory computing , yes MR Of 100 times .
RDD What is it
Elastic distributed data sets
elastic :【 Memory 】 and 【CPU】 Can be extended . Intermediate data exists 【 Memory 】, If there's not enough memory , Can overflow 【 disk 】
Distributed :【 Storage 】 and 【 Calculation 】 Are distributed on multiple nodes
Data sets : A large Abstract container , Use it to follow python Collection is as simple , Support 【 Functional expression 】 Programming .
Core design points
immutable : The elements of the set cannot be changed inside , But it can be converted into a new set .
Divisible :RDD Divided into several parts .
Parallel computing : Each partition is handled by a task , Each task is calculated in parallel .
Recorded in the source code RDD Of 5 Big characteristic
There is a partition list : take 【 All data 】 Divide into a reasonable number of partitions .
Calculation function : Each partition has 【 function 】
Dependency list :RDD Convert to a new RDD, Dependencies are also recorded .
【 Optional 】 Comparator : When RDD The element is 【 Key value pair 】 when , You can specify a partition , Specifies how to press key To group into different partitions . The default is Hash Comparator
【 Optional 】 The best position : The best position of calculation is recorded ( Moving code is more cost-effective than moving data ) such as HDFS Of block Location .
RDD Of 5 The big feature actually shows :
Where is the data ? Where to calculate ? What are the divisions ? What partition to use ? What function is used to calculate ?
边栏推荐
- C# 如何在dataGridView里设置两个列comboboxcolumn绑定级联事件的一个二级联动效果
- 基于InsightFace的高精度人脸识别,可直接对标虹软
- 14 years Bachelor degree, transferred to software testing, salary 13.5k
- R语言做文本挖掘 Part4文本分类
- mysql根据两个字段去重
- JS operation DOM element (I) -- six ways to obtain DOM nodes
- [redis design and implementation] part I: summary of redis data structure and objects
- Redistemplate common collection instructions opsforlist (III)
- ViT论文详解
- [in depth learning] pytorch 1.12 was released, officially supporting Apple M1 chip GPU acceleration and repairing many bugs
猜你喜欢
Quick news: the flybook players' conference is held online; Wechat payment launched "education and training service toolbox"
[interpretation of the paper] machine learning technology for Cataract Classification / classification
Tiktok will push the independent grass planting app "praiseworthy". Can't bytes forget the little red book?
爬虫实战(五):爬豆瓣top250
JPEG2000-Matlab源码实现
跨分片方案 总结
Absolute primes (C language)
Leetcode topic [array] -118 Yang Hui triangle
OneNote in-depth evaluation: using resources, plug-ins, templates
JPEG2000 matlab source code implementation
随机推荐
js 根据汉字首字母排序(省份排序) 或 根据英文首字母排序——za排序 & az排序
This year, Jianzhi Tencent
Summary of cross partition scheme
[in depth learning] pytorch 1.12 was released, officially supporting Apple M1 chip GPU acceleration and repairing many bugs
Nodejs tutorial let's create your first expressjs application with typescript
【力扣刷题】一维动态规划记录(53零钱兑换、300最长递增子序列、53最大子数组和)
The difference between break and continue in the for loop -- break completely end the loop & continue terminate this loop
In JS, string and array are converted to each other (II) -- the method of converting array into string
20220211 failure - maximum amount of data supported by mongodb
c语言char, wchar_t, char16_t, char32_t和字符集的关系
FZU 1686 龙之谜 重复覆盖
Ravendb starts -- document metadata
Guava: use of multiset
npm run dev启动项目报错 document is not defined
What about the spectrogram
el-table表格——sortable排序 & 出现小数、%时排序错乱
JS学习笔记-OO创建怀疑的对象
互联网快讯:吉利正式收购魅族;胰岛素集采在31省全面落地
Sdl2 source analysis 7: performance (sdl_renderpresent())
@GetMapping、@PostMapping 和 @RequestMapping详细区别附实战代码(全)