当前位置:网站首页>Why rdd/dataset is needed in spark
Why rdd/dataset is needed in spark
2022-07-06 21:43:00 【Big data Xiaochen】
No, RDD、Dataset Before , do wordcount Or other big data computing :
Native python aggregate : such as python Of list、set、map, But only stand-alone version is supported , Distributed... Is not supported . If you want to do Distributed Computing , A lot of extra work needs to be done , Like threads / Process communication , Fault tolerance , Automatic load balancing .. It's troublesome to wait . So the framework was born .
You can also use MapReduce: Inefficient operation , Development efficiency is also low .
The birth of Spark/Flink, Reference to native scala The design of the collection , Abstract out new data types -RDD/Dataset
RDD It's actually a distributed collection , Support function operation .. It is as simple to use as a local collection . Fast development speed . The bottom layer is based on distributed memory computing , yes MR Of 100 times .

RDD What is it
Elastic distributed data sets
elastic :【 Memory 】 and 【CPU】 Can be extended . Intermediate data exists 【 Memory 】, If there's not enough memory , Can overflow 【 disk 】
Distributed :【 Storage 】 and 【 Calculation 】 Are distributed on multiple nodes
Data sets : A large Abstract container , Use it to follow python Collection is as simple , Support 【 Functional expression 】 Programming .

Core design points
immutable : The elements of the set cannot be changed inside , But it can be converted into a new set .
Divisible :RDD Divided into several parts .
Parallel computing : Each partition is handled by a task , Each task is calculated in parallel .
Recorded in the source code RDD Of 5 Big characteristic

There is a partition list : take 【 All data 】 Divide into a reasonable number of partitions .
Calculation function : Each partition has 【 function 】
Dependency list :RDD Convert to a new RDD, Dependencies are also recorded .
【 Optional 】 Comparator : When RDD The element is 【 Key value pair 】 when , You can specify a partition , Specifies how to press key To group into different partitions . The default is Hash Comparator
【 Optional 】 The best position : The best position of calculation is recorded ( Moving code is more cost-effective than moving data ) such as HDFS Of block Location .

RDD Of 5 The big feature actually shows :
Where is the data ? Where to calculate ? What are the divisions ? What partition to use ? What function is used to calculate ?
边栏推荐
- Redistemplate common collection instructions opsforlist (III)
- How do I remove duplicates from the list- How to remove duplicates from a list?
- el-table表格——sortable排序 & 出现小数、%时排序错乱
- Five wars of Chinese Baijiu
- Nodejs tutorial expressjs article quick start
- Hill | insert sort
- Binary tree node at the longest distance
- The relationship between root and coefficient of quadratic equation with one variable
- The underlying implementation of string
- R language for text mining Part4 text classification
猜你喜欢
![[sliding window] group B of the 9th Landbridge cup provincial tournament: log statistics](/img/2d/9a7e88fb774984d061538e3ad4a96b.png)
[sliding window] group B of the 9th Landbridge cup provincial tournament: log statistics

What can one line of code do?

C# 如何在dataGridView里设置两个列comboboxcolumn绑定级联事件的一个二级联动效果

Quick news: the flybook players' conference is held online; Wechat payment launched "education and training service toolbox"

1292_FreeROS中vTaskResume()以及xTaskResumeFromISR()的实现分析
Why does MySQL index fail? When do I use indexes?

ViT论文详解

JPEG2000 matlab source code implementation

Digital transformation takes the lead to resume production and work, and online and offline full integration rebuilds business logic

Dialogue with Jia Yangqing, vice president of Alibaba: pursuing a big model is not a bad thing
随机推荐
缓存更新策略概览(Caching Strategies Overview)
麦趣尔砸了小众奶招牌
14年本科毕业,转行软件测试,薪资13.5K
语谱图怎么看
Is this the feeling of being spoiled by bytes?
袁小林:安全不只是标准,更是沃尔沃不变的信仰和追求
Redistemplate common collection instructions opsforlist (III)
The difference between break and continue in the for loop -- break completely end the loop & continue terminate this loop
In JS, string and array are converted to each other (II) -- the method of converting array into string
分糖果
[redis design and implementation] part I: summary of redis data structure and objects
【Redis设计与实现】第一部分 :Redis数据结构和对象 总结
Divide candy
Z function (extended KMP)
【滑动窗口】第九届蓝桥杯省赛B组:日志统计
1292_FreeROS中vTaskResume()以及xTaskResumeFromISR()的实现分析
MySQL - transaction details
jvm:大对象在老年代的分配
Nodejs tutorial expressjs article quick start
The use method of string is startwith () - start with XX, endswith () - end with XX, trim () - delete spaces at both ends