当前位置：网站首页>Explain in detail the functions and underlying implementation logic of the groups sets statement in SQL

Explain in detail the functions and underlying implementation logic of the groups sets statement in SQL

2022-07-06 06:44:00 【Huawei cloud developer Alliance】

Abstract ： This article first briefly introduces Grouping Sets Usage of , And then to Spark SQL As an entry point , In depth analysis of Grouping Sets Implementation mechanism .

This article is shared from Huawei cloud community 《 In depth understanding of SQL Medium Grouping Sets sentence 》, author ： Yuan Runzi .

Preface

SQL in Group By Everyone is familiar with the sentence , Group data according to specified rules , Often and Aggregate functions Use it together .

such as , Consider having a table dealer, The data in the table is as follows ：

id (Int)	city (String)	car_model (String)	quantity (Int)
100	Fremont	Honda Civic	10
100	Fremont	Honda Accord	15
100	Fremont	Honda CRV	7
200	Dublin	Honda Civic	20
200	Dublin	Honda Accord	10
200	Dublin	Honda CRV	3
300	San Jose	Honda Civic	5
300	San Jose	Honda Accord	8

If you execute SQL sentence SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id, We will get the following results ：

 +---+-------------+
 | id|sum(quantity)|
 +---+-------------+
 |100|           32|
 |200|           33|
 |300|           13|
 +---+-------------+

Above SQL Statement means to press id Columns to group , Then pair quantity To sum .

Group By In addition to the simple usage above , There are more advanced uses , Common is Grouping Sets、RollUp and Cube, They are OLAP More commonly used when . among ,RollUp and Cube It's all about Grouping Sets Based on , therefore , I understand Grouping Sets, I understand RollUp and Cube .

This article first briefly introduces Grouping Sets Usage of , And then to Spark SQL As an entry point , In depth analysis of Grouping Sets Implementation mechanism .

Spark SQL yes Apache Spark A sub module of big data processing framework , Used to process structured information . It can be SQL Sentence translation multiple tasks in Spark Execute on cluster , Allow users to pass directly through SQL To process data , Greatly improved usability .

Grouping Sets brief introduction

Spark SQL In official documents SQL Syntax One section right Grouping Sets The statement is described below ：

Groups the rows for each grouping set specified after GROUPING SETS. （... Some examples ） This clause is a shorthand for a UNION ALL where each leg of the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING SETS clause. （... Some examples ）

That is to say ,Grouping Sets The function of the statement is to specify several grouping set As Group By Grouping rules , Then combine the results . Its effect and , First of all, separate these grouping set Conduct Group By After grouping , Re pass Union All Combine the results , It's the same .

such as , about dealer surface ,Group By Grouping Sets ((city, car_model), (city), (car_model), ()) and Union All((Group By city, car_model), (Group By city), (Group By car_model), Global aggregation ) The effect is the same ：

First look at Grouping Sets Implementation results of version ：

 spark-sql> SELECT city, car_model, sum(quantity) AS sum FROM dealer 
          > GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) 
          > ORDER BY city, car_model;
 +--------+------------+---+
 |   city|   car_model|sum|
 +--------+------------+---+
 |    null|        null| 78|
 |    null|Honda Accord| 33|
 |    null|   Honda CRV| 10|
 |    null| Honda Civic| 35|
 | Dublin|        null| 33|
 | Dublin|Honda Accord| 10|
 | Dublin|   Honda CRV|  3|
 | Dublin| Honda Civic| 20|
 | Fremont|        null| 32|
 | Fremont|Honda Accord| 15|
 | Fremont|   Honda CRV|  7|
 | Fremont| Honda Civic| 10|
 |San Jose|        null| 13|
 |San Jose|Honda Accord|  8|
 |San Jose| Honda Civic|  5|
 +--------+------------+---+

Look again Union All Implementation results of version ：

 spark-sql> (SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY city, car_model) UNION ALL 
          > (SELECT city, NULL as car_model, sum(quantity) AS sum FROM dealer GROUP BY city) UNION ALL 
          > (SELECT NULL as city, car_model, sum(quantity) AS sum FROM dealer GROUP BY car_model) UNION ALL 
          > (SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer) 
          > ORDER BY city, car_model;
 +--------+------------+---+
 |   city|   car_model|sum|
 +--------+------------+---+
 |    null|        null| 78|
 |    null|Honda Accord| 33|
 |    null|   Honda CRV| 10|
 |    null| Honda Civic| 35|
 | Dublin|        null| 33|
 | Dublin|Honda Accord| 10|
 | Dublin|   Honda CRV|  3|
 | Dublin| Honda Civic| 20|
 | Fremont|        null| 32|
 | Fremont|Honda Accord| 15|
 | Fremont|   Honda CRV|  7|
 | Fremont| Honda Civic| 10|
 |San Jose|        null| 13|
 |San Jose|Honda Accord|  8|
 |San Jose| Honda Civic|  5|
 +--------+------------+---+

The query results of the two versions are exactly the same .

Grouping Sets Implementation plan of

In terms of execution results ,Grouping Sets Version and Union All Version of SQL It is equivalent. , but Grouping Sets The version is more concise .

that ,Grouping Sets It's just Union All An abbreviation of , Or grammar sugar ？

In order to further explore Grouping Sets Whether the underlying implementation of is consistent with Union All It's consistent , We can take a look at the implementation plan of the two .

First , We go through explain extended Check it out. Union All Version of Optimized Logical Plan:

 spark-sql> explain extended (SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY city, car_model) UNION ALL(SELECT city, NULL as car_model, sum(quantity) AS sum FROM dealer GROUP BY city) UNION ALL (SELECT NULL as city, car_model, sum(quantity) AS sum FROM dealer GROUP BY car_model) UNION ALL (SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer) ORDER BY city, car_model;
 == Parsed Logical Plan ==
 ...
 == Analyzed Logical Plan ==
 ...
 == Optimized Logical Plan ==
 Sort [city#93 ASC NULLS FIRST, car_model#94 ASC NULLS FIRST], true
 +- Union false, false
    :- Aggregate [city#93, car_model#94], [city#93, car_model#94, sum(quantity#95) AS sum#79L]
    :  +- Project [city#93, car_model#94, quantity#95]
    :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#92, city#93, car_model#94, quantity#95], Partition Cols: []]
    :- Aggregate [city#97], [city#97, null AS car_model#112, sum(quantity#99) AS sum#81L]
    :  +- Project [city#97, quantity#99]
    :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#96, city#97, car_model#98, quantity#99], Partition Cols: []]
    :- Aggregate [car_model#102], [null AS city#113, car_model#102, sum(quantity#103) AS sum#83L]
    :  +- Project [car_model#102, quantity#103]
    :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#100, city#101, car_model#102, quantity#103], Partition Cols: []]
    +- Aggregate [null AS city#114, null AS car_model#115, sum(quantity#107) AS sum#86L]
       +- Project [quantity#107]
          +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#104, city#105, car_model#106, quantity#107], Partition Cols: []]
 == Physical Plan ==
 ...

From the above Optimized Logical Plan It can be seen clearly Union All Implementation logic of version ：

Execute each subquery statement , Calculate the query results . among , The logic of each query statement is like this ：
- stay HiveTableRelation Node pair dealer Table scanning .
- stay Project The node selects the columns related to the query statement results , For example, for sub query statements SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer, Just keep quantity Just column .
- stay Aggregate Node to complete quantity Column pair aggregation . In the above Plan in ,Aggregate Immediately following is the column used for grouping , such as Aggregate [city#902] It means according to city Column to group .
stay Union The node completes the union of each sub query result .
Last , stay Sort The node finishes sorting the data , Above Plan in Sort [city#93 ASC NULLS FIRST, car_model#94 ASC NULLS FIRST] It means according to city and car_model Sort columns in ascending order .

Next , We go through explain extended Check it out. Grouping Sets Version of Optimized Logical Plan：

 spark-sql> explain extended SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) ORDER BY city, car_model;
 == Parsed Logical Plan ==
 ...
 == Analyzed Logical Plan ==
 ...
 == Optimized Logical Plan ==
 Sort [city#138 ASC NULLS FIRST, car_model#139 ASC NULLS FIRST], true
 +- Aggregate [city#138, car_model#139, spark_grouping_id#137L], [city#138, car_model#139, sum(quantity#133) AS sum#124L]
    +- Expand [[quantity#133, city#131, car_model#132, 0], [quantity#133, city#131, null, 1], [quantity#133, null, car_model#132, 2], [quantity#133, null, null, 3]], [quantity#133, city#138, car_model#139, spark_grouping_id#137L]
       +- Project [quantity#133, city#131, car_model#132]
          +- HiveTableRelation [`default`.`dealer`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#130, city#131, car_model#132, quantity#133], Partition Cols: []]
 == Physical Plan ==
 ...

from Optimized Logical Plan Look at ,Grouping Sets The version should be much simpler ！ The specific execution logic is as follows ：

stay HiveTableRelation Node pair dealer Table scanning .
stay Project The node selects the columns related to the query statement results .
Next Expand Nodes are the key , After the data passes through this node , There's more spark_grouping_id Column . from Plan You can see that ,Expand The node contains Grouping Sets Every one of them grouping set Information , such as [quantity#133, city#131, null, 1] The corresponding is (city) this grouping set. and , Every grouping set Corresponding spark_grouping_id The values of the columns are fixed , such as (city) Corresponding spark_grouping_id by 1.
stay Aggregate Node to complete quantity Column pair aggregation , The grouping rule is city, car_model, spark_grouping_id. Be careful , Data after Aggregate After node ,spark_grouping_id The column was deleted ！
Last , stay Sort The node finishes sorting the data .

from Optimized Logical Plan Look at , although Union All Version and Grouping Sets The effect of the version is consistent , But there are huge differences in their underlying implementations .

among ,Grouping Sets Version of Plan The most important thing is Expand node , at present , We only know that after the data passes through it , There's more spark_grouping_id Column . And from the end result ,spark_grouping_id It's just Spark SQL Internal implementation details of , Not for users . that ：

Expand What is the implementation logic of , Why can we achieve Union All The effect of ？
Expand What is the output data of the node ？
spark_grouping_id What is the function of columns ？

adopt Physical Plan, We found that Expand The operator name corresponding to the node is also Expand:

 == Physical Plan ==
 AdaptiveSparkPlan isFinalPlan=false
 +- Sort [city#138 ASC NULLS FIRST, car_model#139 ASC NULLS FIRST], true, 0
    +- Exchange rangepartitioning(city#138 ASC NULLS FIRST, car_model#139 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=422]
       +- HashAggregate(keys=[city#138, car_model#139, spark_grouping_id#137L], functions=[sum(quantity#133)], output=[city#138, car_model#139, sum#124L])
          +- Exchange hashpartitioning(city#138, car_model#139, spark_grouping_id#137L, 200), ENSURE_REQUIREMENTS, [plan_id=419]
             +- HashAggregate(keys=[city#138, car_model#139, spark_grouping_id#137L], functions=[partial_sum(quantity#133)], output=[city#138, car_model#139, spark_grouping_id#137L, sum#141L])
                +- Expand [[quantity#133, city#131, car_model#132, 0], [quantity#133, city#131, null, 1], [quantity#133, null, car_model#132, 2], [quantity#133, null, null, 3]], [quantity#133, city#138, car_model#139, spark_grouping_id#137L]
                   +- Scan hive default.dealer [quantity#133, city#131, car_model#132], HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#130, city#131, car_model#132, quantity#133], Partition Cols: []]

With the previous questions , Next, let's go deep into Spark SQL Of Expand Operator source code to find the answer .

Expand The realization of operators

Expand The operator is in Spark SQL The implementation in the source code is ExpandExec class （Spark SQL The names of operator implementation classes in are XxxExec The format of , among Xxx Is the specific operator name , such as Project The implementation class of the operator is ProjectExec）, The core code is as follows ：

 /**
  * Apply all of the GroupExpressions to every input row, hence we will get
  * multiple output rows for an input row.
  * @param projections The group of expressions, all of the group expressions should
  *                   output the same schema specified bye the parameter `output`
  * @param output     The output Schema
  * @param child       Child operator
  */
 case class ExpandExec(
     projections: Seq[Seq[Expression]],
     output: Seq[Attribute],
     child: SparkPlan)
   extends UnaryExecNode with CodegenSupport {
 
  ...
   //  Key points 1, take child.output, That is, the output data of the upstream operator schema,
   //  Bound to an expression array exprs, To calculate the output data 
   private[this] val projection =
    (exprs: Seq[Expression]) => UnsafeProjection.create(exprs, child.output)
 
   // doExecute() Method is Expand Where the operator executes logic 
   protected override def doExecute(): RDD[InternalRow] = {
     val numOutputRows = longMetric("numOutputRows")
 
     //  Processing the output data of upstream operators ,Expand The input data of the operator starts from iter Iterator gets 
     child.execute().mapPartitions { iter =>
       //  Key points 2,projections Corresponding Grouping Sets Each of them grouping set The expression of ,
       //  Expression output data schema by this.output,  such as  (quantity, city, car_model, spark_grouping_id)
       //  The logic here is to generate a UnsafeProjection object , Through the apply The method will lead to Expand The output data of the operator 
       val groups = projections.map(projection).toArray
       new Iterator[InternalRow] {
         private[this] var result: InternalRow = _
         private[this] var idx = -1  // -1 means the initial state
         private[this] var input: InternalRow = _
 
         override final def hasNext: Boolean = (-1 < idx && idx < groups.length) || iter.hasNext
 
         override final def next(): InternalRow = {
           //  Key points 3, For each record of input data , Are reused N Time , among N The size of corresponds to projections Size of array ,
           //  That is to say Grouping Sets In the specified grouping set The number of 
           if (idx <= 0) {
             // in the initial (-1) or beginning(0) of a new input row, fetch the next input tuple
             input = iter.next()
             idx = 0
          }
           //  Key points 4, For each record of input data , adopt UnsafeProjection Calculate the output data ,
           //  Every grouping set Corresponding UnsafeProjection Will be the same input Do a calculation 
           result = groups(idx)(input)
           idx += 1
 
           if (idx == groups.length && iter.hasNext) {
             idx = 0
          }
 
           numOutputRows += 1
           result
        }
      }
    }
  }
  ...
 }

ExpandExec The implementation of is not complicated , Want to understand how it works , The key is to understand what is mentioned in the above source code 4 A key point .

Key points 1 and Key points 2 It's the foundation , Key points 2 Medium groups It's a UnsafeProjection[N] An array type , Each of them UnsafeProjection On behalf of Grouping Sets Specified in the statement grouping set, It's defined like this ：

 // A projection that returns UnsafeRow.
 abstract class UnsafeProjection extends Projection {
   override def apply(row: InternalRow): UnsafeRow
 }
 
 // The factory object for `UnsafeProjection`.
 object UnsafeProjection
     extends CodeGeneratorWithInterpretedFallback[Seq[Expression], UnsafeProjection] {
   // Returns an UnsafeProjection for given sequence of Expressions, which will be bound to
   // `inputSchema`.
   def create(exprs: Seq[Expression], inputSchema: Seq[Attribute]): UnsafeProjection = {
     create(bindReferences(exprs, inputSchema))
  }
  ...
 }

UnsafeProjection It is similar to the function of column projection , among , apply Method is based on the parameters passed during creation exprs and inputSchema, Column projection of input records , Get the output record .

such as , Ahead GROUPING SETS ((city, car_model), (city), (car_model), ()) Example , Its corresponding groups That's true ：

among ,AttributeReference Expression of type , At the time of calculation , It will directly reference the value of the corresponding column of the input data ;Iteral Expression of type , At the time of calculation , The value is fixed .

Key points 3 and Key points 4 yes Expand The essence of operators ,ExpandExec Through these two paragraphs of logic , Record every input , Expand （Expand） become N Output records .

Key points 4 in groups(idx)(input) Equate to groups(idx).apply(input) .

Or the front GROUPING SETS ((city, car_model), (city), (car_model), ()) As an example , The effect is this ：

Come here , We have made it clear Expand How the operator works , Look back at the previous mentioned 3 A question , It's not difficult to answer ：

Expand What is the implementation logic of , Why can we achieve Union All The effect of ？
if Union All It is to aggregate first and then unite , that Expand Is to unite first and then aggregate .Expand utilize groups Inside N An expression evaluates each input record , Expanded into N Output records . When we aggregate later , Can achieve and Union All The same effect .
Expand What is the output data of the node ？
stay schema On ,Expand The output data will be more than the input data spark_grouping_id Column ; On the record number , Is the number of input data records N times .
spark_grouping_id What is the function of columns ？
spark_grouping_id For each grouping set Number , such , Even in Expand The stage combines the data first , stay Aggregate Stage （ hold spark_grouping_id Add to the grouping rule ） It can also ensure that the data can be in accordance with each grouping set Aggregate separately , Ensure the correctness of the results .

Query performance comparison

From the above, we can see that ,Grouping Sets and Union All Two versions of SQL Statements have the same effect , But there are huge differences in their implementation plans . below , We will compare the performance differences between the two versions .

spark-sql After execution SQL The time-consuming information will be printed after the statement , We have two versions of SQL Separately 10 Time , Get the following information ：

 // Grouping Sets  Version execution 10 Time consuming information 
 // SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) ORDER BY city, car_model;
 Time taken: 0.289 seconds, Fetched 15 row(s)
 Time taken: 0.251 seconds, Fetched 15 row(s)
 Time taken: 0.259 seconds, Fetched 15 row(s)
 Time taken: 0.258 seconds, Fetched 15 row(s)
 Time taken: 0.296 seconds, Fetched 15 row(s)
 Time taken: 0.247 seconds, Fetched 15 row(s)
 Time taken: 0.298 seconds, Fetched 15 row(s)
 Time taken: 0.286 seconds, Fetched 15 row(s)
 Time taken: 0.292 seconds, Fetched 15 row(s)
 Time taken: 0.282 seconds, Fetched 15 row(s)
 
 // Union All  Version execution 10 Time consuming information 
 // (SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY city, car_model) UNION ALL (SELECT city, NULL as car_model, sum(quantity) AS sum FROM dealer GROUP BY city) UNION ALL (SELECT NULL as city, car_model, sum(quantity) AS sum FROM dealer GROUP BY car_model) UNION ALL (SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer) ORDER BY city, car_model;
 Time taken: 0.628 seconds, Fetched 15 row(s)
 Time taken: 0.594 seconds, Fetched 15 row(s)
 Time taken: 0.591 seconds, Fetched 15 row(s)
 Time taken: 0.607 seconds, Fetched 15 row(s)
 Time taken: 0.616 seconds, Fetched 15 row(s)
 Time taken: 0.64 seconds, Fetched 15 row(s)
 Time taken: 0.623 seconds, Fetched 15 row(s)
 Time taken: 0.625 seconds, Fetched 15 row(s)
 Time taken: 0.62 seconds, Fetched 15 row(s)
 Time taken: 0.62 seconds, Fetched 15 row(s)

You can work out ,Grouping Sets Version of SQL The average time taken is 0.276s;Union All Version of SQL The average time taken is 0.616s, It's the former 2.2 times ！

therefore ,Grouping Sets Version of SQL It is not only more concise in expression , It is also more efficient in performance .

RollUp and Cube

Group By In advanced usage , also RollUp and Cube Two are commonly used .

First , Let's take a look at RollUp sentence .

Spark SQL In official documents SQL Syntax One section right RollUp The statement is described below ：

Specifies multiple levels of aggregations in a single statement. This clause is used to compute aggregations based on multiple grouping sets. ROLLUP is a shorthand for GROUPING SETS. （... Some examples ）

In official documents , hold RollUp Described as Grouping Sets Abbreviation , The equivalence rule is ：RollUp(A, B, C) == Grouping Sets((A, B, C), (A, B), (A), ()).

such as ,Group By RollUp(city, car_model) Is equivalent to Group By Grouping Sets((city, car_model), (city), ()).

below , We go through expand extended look down RollUp edition SQL Of Optimized Logical Plan：

 spark-sql> explain extended SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY ROLLUP(city, car_model) ORDER BY city, car_model;
 == Parsed Logical Plan ==
 ...
 == Analyzed Logical Plan ==
 ...
 == Optimized Logical Plan ==
 Sort [city#2164 ASC NULLS FIRST, car_model#2165 ASC NULLS FIRST], true
 +- Aggregate [city#2164, car_model#2165, spark_grouping_id#2163L], [city#2164, car_model#2165, sum(quantity#2159) ASsum#2150L]
    +- Expand [[quantity#2159, city#2157, car_model#2158, 0], [quantity#2159, city#2157, null, 1], [quantity#2159, null, null, 3]], [quantity#2159, city#2164, car_model#2165, spark_grouping_id#2163L]
       +- Project [quantity#2159, city#2157, car_model#2158]
          +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#2156, city#2157, car_model#2158, quantity#2159], Partition Cols: []]
 == Physical Plan ==
 ...

From the above Plan It can be seen that ,RollUp The underlying implementation is also used Expand operator , explain RollUp It's really based on Grouping Sets Realized . and Expand [[quantity#2159, city#2157, car_model#2158, 0], [quantity#2159, city#2157, null, 1], [quantity#2159, null, null, 3]] Also shows RollUp Conform to the equivalence rule .

below , We follow the same line of thinking , look down Cube sentence .

Spark SQL In official documents SQL Syntax One section right Cube The statement is described below ：

CUBE clause is used to perform aggregations based on combination of grouping columns specified in the GROUP BYclause. CUBE is a shorthand for GROUPING SETS. (... Some examples )

Again , Official documents put Cube Described as Grouping Sets Abbreviation , The equivalence rule is ：Cube(A, B, C) == Grouping Sets((A, B, C), (A, B), (A, C), (B, C), (A), (B), (C), ()).

such as ,Group By Cube(city, car_model) Is equivalent to Group By Grouping Sets((city, car_model), (city), (car_model), ()).

below , We go through expand extended look down Cube edition SQL Of Optimized Logical Plan：

 spark-sql> explain extended SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY CUBE(city, car_model) ORDER BY city, car_model;
 == Parsed Logical Plan ==
 ...
 == Analyzed Logical Plan ==
 ...
 == Optimized Logical Plan ==
 Sort [city#2202 ASC NULLS FIRST, car_model#2203 ASC NULLS FIRST], true
 +- Aggregate [city#2202, car_model#2203, spark_grouping_id#2201L], [city#2202, car_model#2203, sum(quantity#2197) ASsum#2188L]
    +- Expand [[quantity#2197, city#2195, car_model#2196, 0], [quantity#2197, city#2195, null, 1], [quantity#2197, null, car_model#2196, 2], [quantity#2197, null, null, 3]], [quantity#2197, city#2202, car_model#2203, spark_grouping_id#2201L]
       +- Project [quantity#2197, city#2195, car_model#2196]
          +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#2194, city#2195, car_model#2196, quantity#2197], Partition Cols: []]
 == Physical Plan ==
 ...

From the above Plan It can be seen that ,Cube The bottom layer is also used Expand operator , explain Cube It's based on Grouping Sets Realization , And it also conforms to the equivalence rule .

therefore ,RollUp and Cube It can be seen as Grouping Sets The grammar sugar of , The underlying implementation and performance are the same .

Last

This paper focuses on Group By Advanced usage Groupings Sets The function and underlying implementation of the statement .

although Groupings Sets The function of , adopt Union All Can also be realized , But the former is not the grammatical sugar of the latter , Their underlying implementation is completely different .Grouping Sets It adopts the idea of combining first and then aggregating , adopt spark_grouping_id Columns to ensure the correctness of data ;Union All Then we adopt the idea of first aggregation and then combination .Grouping Sets stay SQL Statement expression and performance have greater advantages .

Group By The other two advanced uses of RollUp and Cube It can be seen as Grouping Sets The grammar sugar of , Their bottom layer is based on Expand Operator implementation , In terms of performance and direct use Grouping Sets It's the same , But in SQL More concise expression .