当前位置:网站首页>Spark Bucket Table Join
Spark Bucket Table Join
2022-07-27 15:38:00 【wankunde】
Generate Bucket surface
establish Bucket surface
- Mode one
spark.sql("DROP TABLE IF EXISTS user1_bucket")
spark.sql("DROP TABLE IF EXISTS user2_bucket")
val r = new scala.util.Random()
val df = spark.range(1, 100).map(i => (i, s"wankun-${r.nextInt(100)}")).toDF("id", "name")
df.write.
bucketBy(10, "name").
sortBy("name").
mode("overwrite").
saveAsTable("user1_bucket")
scala> spark.sql("show create table user1_bucket").show(false)
+------------------------------------------------------------------------------------------------------------------------------------+
|createtab_stmt |
+------------------------------------------------------------------------------------------------------------------------------------+
|CREATE TABLE `default`.`users` (
`id` BIGINT,
`name` STRING)
USING parquet
CLUSTERED BY (name)
SORTED BY (name)
INTO 10 BUCKETS
|
+------------------------------------------------------------------------------------------------------------------------------------+
- Mode two
CREATE TABLE user2_bucket (
`id` BIGINT,
`name` STRING)
USING parquet
CLUSTERED BY (name)
INTO 10 BUCKETS;
INSERT OVERWRITE TABLE user2_bucket
SELECT id, concat("wankun-",cast(rand()*100 as int)) as name
FROM range(1, 100);
Generated result file
[1] $ hdfs dfs -ls /user/hive/warehouse/user1_bucket
21/05/12 17:11:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Found 21 items
-rw-r--r-- 1 wakun supergroup 0 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/_SUCCESS
-rw-r--r-- 1 wakun supergroup 812 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 815 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00001.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 833 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00002.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 797 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00003.c000.snappy.parquet
...
-rw-r--r-- 1 wakun supergroup 817 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00009.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 812 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 815 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00001.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 833 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00002.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 831 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00003.c000.snappy.parquet
....
-rw-r--r-- 1 wakun supergroup 788 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00008.c000.snappy.parquet
-rw-r--r-- 1 wakun supergroup 817 2021-05-12 17:05 /user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00009.c000.snappy.parquet
Same data Key It must be in the same Bucket
scala> spark.read.parquet("/user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet").show(false)
+---+---------+
|id |name |
+---+---------+
|21 |wankun-57|
|12 |wankun-73|
|37 |wankun-73|
|10 |wankun-89|
|17 |wankun-89|
|35 |wankun-89|
+---+---------+
scala> spark.read.parquet("/user/hive/warehouse/user1_bucket/part-00001-b657899a-401b-406b-b45d-ac4df1d72e14_00000.c000.snappy.parquet").show(false)
+---+---------+
|id |name |
+---+---------+
|70 |wankun-57|
|61 |wankun-73|
|86 |wankun-73|
|59 |wankun-89|
|66 |wankun-89|
|84 |wankun-89|
+---+---------+
scala> spark.read.parquet("/user/hive/warehouse/user1_bucket/part-00000-b657899a-401b-406b-b45d-ac4df1d72e14_00001.c000.snappy.parquet").show(false)
+---+---------+
|id |name |
+---+---------+
|1 |wankun-0 |
|8 |wankun-12|
|13 |wankun-12|
|16 |wankun-29|
|27 |wankun-85|
|34 |wankun-85|
+---+---------+
test bucket table join VS nonbucket table join
set spark.sql.autoBroadcastJoinThreshold=-1;
-- bucket table join
SELECT t1.id, t1.name, t2.id as id2, t2.name as name2
FROM user1_bucket t1
JOIN user2_bucket t2
ON t1.name = t2.name;
-- non bucket table join
SELECT t1.id, t1.name, t2.id as id2, t2.name as name2
FROM user1 t1
JOIN user2 t2
ON t1.name = t2.name;
Bucket Join The process 
NonBucket Join The process 
by force of contrast ,Bucket Table Of Join One less time Shuffle The process of .
边栏推荐
猜你喜欢
随机推荐
JS find the maximum and minimum values in the array (math.max() method)
使用Prometheus监控Spark任务
Network equipment hard core technology insider router Chapter 4 Jia Baoyu sleepwalking in Taixu Fantasy (Part 2)
How "Crazy" is Hefu Laomian, which is eager to be listed, with capital increasing frequently?
Leetcode 783. binary search tree node minimum distance tree /easy
IJCAI 2022 outstanding papers were published, and 298 Chinese mainland authors won the first place in two items
HJ8 合并表记录
C:什么是函数中的返回值(转)
leetcode-1:两数之和
Spark Bucket Table Join
Leetcode 81. search rotation sort array II binary /medium
Summer Challenge harmonyos realizes a hand-painted board
Spark 3.0 测试与使用
Spark动态资源分配的资源释放过程及BlockManager清理过程
C语言中交换两数的方法
Google team launches new transformer to optimize panoramic segmentation scheme CVPR 2022
Spark 本地程序启动缓慢问题排查
Leetcode 240. search two-dimensional matrix II medium
Multi table query_ Sub query overview and multi table query_ Sub query situation 1 & situation 2 & situation 3
After configuring corswebfilter in grain mall, an error is reported: resource sharing error:multiplealloworiginvalues








