当前位置:网站首页>Spark Bucket Table Join
Spark Bucket Table Join
2022-07-27 14:23:00 【wankunde】
生成Bucket表
创建Bucket表
- 方式一
spark.sql("DROP TABLE IF EXISTS user1_bucket")
spark.sql("DROP TABLE IF EXISTS user2_bucket")
val r = new scala.util.Random()
val df = spark.range(1, 100).map(i => (i, s"wankun-${r.nextInt(100)}")).toDF("id", "name")
df.write.
bucketBy(10, "name").
sortBy("name").
mode("overwrite").
saveAsTable("user1_bucket")
scala> spark.sql("show create table user1_bucket").show(false)
+------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `default`.`users` (
`id` BIGINT,
`name` STRING)
USING parquet
CLUSTERED BY (name)
SORTED BY (name)
INTO 10 BUCKETS
|
+------------------------------------------------------------------------------------------------------------------------------------+
- 方式二
CREATE TABLE user2_bucket (
`id` BIGINT,
`name` STRING)
USING parquet
CLUSTERED BY (name)
INTO 10 BUCKETS;
INSERT OVERWRITE TABLE user2_bucket
SELECT id, concat("wankun-",cast(rand()*100 as int)) as name
FROM range(1, 100);
生成的结果文件
[1] $ hdfs dfs -ls /user/hive/warehouse/user1_bucket
21/05/12 17:11:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 21 items
-rw-r--r-- 1 wakun supergroup 0 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/_SUCCESS
-rw-r--r-- 1 wakun supergroup 812 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 815 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00001.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 833 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00002.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 797 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00003.c000.snappy.parquet
...
-rw-r--r-- 1 wakun supergroup 817 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00009.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 812 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 815 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00001.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 833 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00002.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 831 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00003.c000.snappy.parquet
....
-rw-r--r-- 1 wakun supergroup 788 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00008.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 817 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00009.c000.snappy.parquet
相同数据的Key一定在同一个Bucket
scala> spark.read.parquet("/user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet").show(false)
+---+---------+
|id |name |
+---+---------+
|21 |wankun-57|
|12 |wankun-73|
|37 |wankun-73|
|10 |wankun-89|
|17 |wankun-89|
|35 |wankun-89|
+---+---------+
scala> spark.read.parquet("/user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet").show(false)
+---+---------+
|id |name |
+---+---------+
|70 |wankun-57|
|61 |wankun-73|
|86 |wankun-73|
|59 |wankun-89|
|66 |wankun-89|
|84 |wankun-89|
+---+---------+
scala> spark.read.parquet("/user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00001.c000.snappy.parquet").show(false)
+---+---------+
|id |name |
+---+---------+
|1 |wankun-0 |
|8 |wankun-12|
|13 |wankun-12|
|16 |wankun-29|
|27 |wankun-85|
|34 |wankun-85|
+---+---------+
测试 bucket table join VS nonbucket table join
set spark.sql.autoBroadcastJoinThreshold=-1;
-- bucket table join
SELECT t1.id, t1.name, t2.id as id2, t2.name as name2
FROM user1_bucket t1
JOIN user2_bucket t2
ON t1.name = t2.name;
-- non bucket table join
SELECT t1.id, t1.name, t2.id as id2, t2.name as name2
FROM user1 t1
JOIN user2 t2
ON t1.name = t2.name;
Bucket Join 过程
NonBucket Join 过程
通过对比,Bucket Table的Join少了一次Shuffle的过程。
边栏推荐
- Notice on printing and distributing the Interim Measures for the administration of green manufacturing pilot demonstration of Shenzhen Bureau of industry and information technology
- Kubernetes CNI classification / operation mechanism
- ADB command (install APK package format: ADB install APK address package name on the computer)
- Network equipment hard core technology insider router Chapter 11 Cisco asr9900 disassembly (V)
- 基于FIFO IDT7202-12的数字存储示波器
- 两阶段提交与三阶段提交
- MySQL interview 40 consecutive questions, interviewer, if you continue to ask, I will turn my face
- Network equipment hard core technology insider router chapter Cisco asr9900 disassembly (I)
- DIY ultra detailed tutorial on making oscilloscope: (1) I'm not trying to make an oscilloscope
- After configuring corswebfilter in grain mall, an error is reported: resource sharing error:multiplealloworiginvalues
猜你喜欢

Unity 鼠标控制第一人称摄像机视角

谷歌团队推出新Transformer,优化全景分割方案|CVPR 2022

适配验证新职业来了!华云数据参与国家《信息系统适配验证师国家职业技能标准》编制

EMC design scheme of RS485 interface

MySQL interview 40 consecutive questions, interviewer, if you continue to ask, I will turn my face
仪表放大器和运算放大器优缺点对比

USB2.0接口的EMC设计方案

JUC(JMM、Volatile)

What is the breakthrough point of digital transformation in the electronic manufacturing industry? Lean manufacturing is the key

Leetcode 240. search two-dimensional matrix II medium
随机推荐
Unity3d learning note 10 - texture array
LeetCode 783. 二叉搜索树节点最小距离 树/easy
Network equipment hard core technology insider router Chapter 4 Jia Baoyu sleepwalking in Taixu Fantasy (Part 2)
IJCAI 2022杰出论文公布,大陆作者中稿298篇拿下两项第一
After configuring corswebfilter in grain mall, an error is reported: resource sharing error:multiplealloworiginvalues
Unity performance optimization ----- LOD (level of detail) of rendering optimization (GPU)
Reading notes of lifelong growth (I)
DIY制作示波器的超详细教程:(一)我不是为了做一个示波器
华云数据打造完善的信创人才培养体系 助力信创产业高质量发展
Network equipment hard core technology insider router Chapter 17 dpdk and its prequel (II)
Four kinds of relay schemes driven by single chip microcomputer
Leetcode 783. binary search tree node minimum distance tree /easy
lua学习笔记
Two stage submission and three stage submission
多线程环境下CountDownLatch的用法
The reverse order pairs in the "sword finger offer" array
Network equipment hard core technology insider router Chapter 10 Cisco asr9900 disassembly (III)
Inside router of network equipment hard core technology (10) disassembly of Cisco asr9900 (4)
Network equipment hard core technology insider router 20 dpdk (V)
The first common node of the two linked lists of "Jianzhi offer"