当前位置:网站首页>Why rdd/dataset is needed in spark
Why rdd/dataset is needed in spark
2022-07-06 21:43:00 【Big data Xiaochen】
No, RDD、Dataset Before , do wordcount Or other big data computing :
Native python aggregate : such as python Of list、set、map, But only stand-alone version is supported , Distributed... Is not supported . If you want to do Distributed Computing , A lot of extra work needs to be done , Like threads / Process communication , Fault tolerance , Automatic load balancing .. It's troublesome to wait . So the framework was born .
You can also use MapReduce: Inefficient operation , Development efficiency is also low .
The birth of Spark/Flink, Reference to native scala The design of the collection , Abstract out new data types -RDD/Dataset
RDD It's actually a distributed collection , Support function operation .. It is as simple to use as a local collection . Fast development speed . The bottom layer is based on distributed memory computing , yes MR Of 100 times .
RDD What is it
Elastic distributed data sets
elastic :【 Memory 】 and 【CPU】 Can be extended . Intermediate data exists 【 Memory 】, If there's not enough memory , Can overflow 【 disk 】
Distributed :【 Storage 】 and 【 Calculation 】 Are distributed on multiple nodes
Data sets : A large Abstract container , Use it to follow python Collection is as simple , Support 【 Functional expression 】 Programming .
Core design points
immutable : The elements of the set cannot be changed inside , But it can be converted into a new set .
Divisible :RDD Divided into several parts .
Parallel computing : Each partition is handled by a task , Each task is calculated in parallel .
Recorded in the source code RDD Of 5 Big characteristic
There is a partition list : take 【 All data 】 Divide into a reasonable number of partitions .
Calculation function : Each partition has 【 function 】
Dependency list :RDD Convert to a new RDD, Dependencies are also recorded .
【 Optional 】 Comparator : When RDD The element is 【 Key value pair 】 when , You can specify a partition , Specifies how to press key To group into different partitions . The default is Hash Comparator
【 Optional 】 The best position : The best position of calculation is recorded ( Moving code is more cost-effective than moving data ) such as HDFS Of block Location .
RDD Of 5 The big feature actually shows :
Where is the data ? Where to calculate ? What are the divisions ? What partition to use ? What function is used to calculate ?
边栏推荐
- R3live notes: image processing section
- The difference between break and continue in the for loop -- break completely end the loop & continue terminate this loop
- 039. (2.8) thoughts in the ward
- JPEG2000-Matlab源码实现
- 首批入选!腾讯安全天御风控获信通院业务安全能力认证
- [go][reprint]vscode run a HelloWorld example after configuring go
- Web开发小妙招:巧用ThreadLocal规避层层传值
- 通过数字电视通过宽带网络取代互联网电视机顶盒应用
- Description of web function test
- @Detailed differences among getmapping, @postmapping and @requestmapping, with actual combat code (all)
猜你喜欢
Digital transformation takes the lead to resume production and work, and online and offline full integration rebuilds business logic
Why do job hopping take more than promotion?
Shake Sound poussera l'application indépendante de plantation d'herbe "louable", les octets ne peuvent pas oublier le petit livre rouge?
Summary of cross partition scheme
Enhance network security of kubernetes with cilium
[Li Kou brushing questions] one dimensional dynamic planning record (53 change exchanges, 300 longest increasing subsequence, 53 largest subarray and)
JS method to stop foreach
【力扣刷题】一维动态规划记录(53零钱兑换、300最长递增子序列、53最大子数组和)
爬虫实战(五):爬豆瓣top250
[redis design and implementation] part I: summary of redis data structure and objects
随机推荐
[Digital IC manual tearing code] Verilog automatic beverage machine | topic | principle | design | simulation
string的底层实现
VIM basic configuration and frequently used commands
C how to set two columns comboboxcolumn in DataGridView to bind a secondary linkage effect of cascading events
PostgreSQL install GIS plug-in create extension PostGIS_ topology
[go][转载]vscode配置完go跑个helloworld例子
Four common ways and performance comparison of ArrayList de duplication (jmh performance analysis)
guava:Collections. The collection created by unmodifiablexxx is not immutable
The role of applicationmaster in spark on Yan's cluster mode
快讯:飞书玩家大会线上举行;微信支付推出“教培服务工具箱”
The use method of string is startwith () - start with XX, endswith () - end with XX, trim () - delete spaces at both ends
Tiktok will push the independent grass planting app "praiseworthy". Can't bytes forget the little red book?
Digital transformation takes the lead to resume production and work, and online and offline full integration rebuilds business logic
1D convolution detail
爬虫实战(五):爬豆瓣top250
The underlying implementation of string
JS according to the Chinese Alphabet (province) or according to the English alphabet - Za sort &az sort
Acdreamoj1110 (multiple backpacks)
Is it profitable to host an Olympic Games?
启动嵌入式间:资源有限的系统启动