当前位置:网站首页>Spark source code analysis (I): RDD collection data - partition data allocation
Spark source code analysis (I): RDD collection data - partition data allocation
2022-06-26 05:53:00 【Little five】
RDD What is it? : Distributed elastic datasets
Problems to be solved :rdd And how to allocate the partition data of the data source ???
for example (1,2,3,4), Partition number numSlices=3,RDD How to partition data storage in ?
Viewing the source code allows us to quickly understand .

Enter makeRDD function , See the inside is implemented parallelize function , And pass in the number of sets and partitions .

parallelize The function creates a ParallelCollectionRDD object .
And then , Get into ParallelCollectionRDD class Inside .
There is a method with the same name :
Slice the collection into numSlices subset . Another thing we're doing here is dealing with scopes, especially collections , Encode slices into other ranges to minimize memory costs . This makes it possible to represent a large number of data sets RDD Up operation Spark It's very effective . If the collection is an inclusive range , We use the include range for the last slice .

stay slice Functions and position After the function , Pattern matching is required .
case 1: Range Range, If the scope contains , For the last slice “ Coverage ”
case 2: For long 、 Double precision 、 Large integer, etc
case 3: other -> Conduct position function

position Function input ( The length of the set , Partition number ), And for the [0,numSlices) To iterate (until Left closed right away )
Calculate according to the rules start and end, Finally, you will get the partition rules .
// for example (1,2,3,4,5) numslices=3 -> Yes 0,1,2 iteration
// Generate three partition rules (0,1](1,3](3,5]
And then use it slice Conduct array segmentation .
That is to say (1)(2,3)(4,5)
边栏推荐
- Ribbon负载均衡服务调用
- Bubble sort
- 421-二叉树(226. 翻转二叉树、101. 对称二叉树、104.二叉树的最大深度、222.完全二叉树的节点个数)
- SQL Server 函数
- 转帖——不要迷失在技术的海洋中
- Test depends on abstraction and does not depend on concrete
- uniCloud云开发获取小程序用户openid
- Soft power and hard power in program development
- Household accounting procedures (First Edition)
- A new journey
猜你喜欢
随机推荐
家庭记账程序(第二版 加入了循环)
June 3 is a happy day
【群内问题学期汇总】初学者的部分参考问题
Some doubts about ARP deception experiment
E-commerce seeks growth breakthrough with the help of small program technology
How Navicat reuses the current connection information to another computer
BOM document
Last flight
SQL Server视图
The most refined language interprets the event dispatcher (also known as the event scheduler)
String类学习
Factory method pattern, abstract factory pattern
Uni app ceiling fixed style
FindControl的源代码
Household accounting procedures (the second edition includes a cycle)
工厂方法模式、抽象工厂模式
421-二叉树(226. 翻转二叉树、101. 对称二叉树、104.二叉树的最大深度、222.完全二叉树的节点个数)
RIA ideas
Daily production training report (16)
Bubble sort









