深度剖析SQL中的Grouping Sets语句1-电子发烧友网

前言

SQL 中 Group By 语句大家都很熟悉， 根据指定的规则对数据进行分组 ，常常和聚合函数一起使用。

比如，考虑有表 dealer，表中数据如下：

id (Int)	city (String)	car_model (String)	quantity (Int)
100	Fremont	Honda Civic	10
100	Fremont	Honda Accord	15
100	Fremont	Honda CRV	7
200	Dublin	Honda Civic	20
200	Dublin	Honda Accord	10
200	Dublin	Honda CRV	3
300	San Jose	Honda Civic	5
300	San Jose	Honda Accord	8

如果执行 SQL 语句 SELECT id, sum(quantity) FROM dealer GROUP BY id ORDER BY id，会得到如下结果：

+---+-------------+
| id|sum(quantity)|
+---+-------------+
|100|           32|
|200|           33|
|300|           13|
+---+-------------+

上述 SQL 语句的意思就是对数据按 id 列进行分组，然后在每个分组内对 quantity 列进行求和。

Group By 语句除了上面的简单用法之外，还有更高级的用法，常见的是 Grouping Sets、RollUp 和 Cube，它们在 OLAP 时比较常用。其中，RollUp 和 Cube 都是以 Grouping Sets 为基础实现的，因此，弄懂了 Grouping Sets，也就理解了 RollUp 和 Cube 。

本文首先简单介绍 Grouping Sets 的用法，然后以 Spark SQL 作为切入点，深入解析 Grouping Sets 的实现机制。

Spark SQL 是 Apache Spark 大数据处理框架的一个子模块，用来处理结构化信息。它可以将 SQL 语句翻译多个任务在 Spark 集群上执行， 允许用户直接通过 SQL 来处理数据 ，大大提升了易用性。

Grouping Sets 简介

Spark SQL 官方文档中 SQL Syntax 一节对 Grouping Sets 语句的描述如下：

Groups the rows for each grouping set specified after GROUPING SETS. （... 一些举例） This clause is a shorthand for a UNION ALLwhere each leg of the UNION ALL operator performs aggregation of each grouping set specified in the GROUPING SETS clause. （... 一些举例）

也即，Grouping Sets 语句的作用是指定几个 grouping set 作为 Group By 的分组规则，然后再将结果联合在一起。它的效果和， 先分别对这些 grouping set 进行 Group By 分组之后，再通过 Union All 将结果联合起来 ，是一样的。

比如，对于 dealer 表，Group By Grouping Sets ((city, car_model), (city), (car_model), ()) 和 Union All((Group By city, car_model), (Group By city), (Group By car_model), 全局聚合) 的效果是相同的：

先看 Grouping Sets 版的执行结果：

spark-sql> SELECT city, car_model, sum(quantity) AS sum FROM dealer 
         > GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) 
         > ORDER BY city, car_model;
+--------+------------+---+
|    city|   car_model|sum|
+--------+------------+---+
|    null|        null| 78|
|    null|Honda Accord| 33|
|    null|   Honda CRV| 10|
|    null| Honda Civic| 35|
|  Dublin|        null| 33|
|  Dublin|Honda Accord| 10|
|  Dublin|   Honda CRV|  3|
|  Dublin| Honda Civic| 20|
| Fremont|        null| 32|
| Fremont|Honda Accord| 15|
| Fremont|   Honda CRV|  7|
| Fremont| Honda Civic| 10|
|San Jose|        null| 13|
|San Jose|Honda Accord|  8|
|San Jose| Honda Civic|  5|
+--------+------------+---+

再看 Union All 版的执行结果：

spark-sql> (SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY city, car_model) UNION ALL 
         > (SELECT city, NULL as car_model, sum(quantity) AS sum FROM dealer GROUP BY city) UNION ALL 
         > (SELECT NULL as city, car_model, sum(quantity) AS sum FROM dealer GROUP BY car_model) UNION ALL 
         > (SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer) 
         > ORDER BY city, car_model;
+--------+------------+---+
|    city|   car_model|sum|
+--------+------------+---+
|    null|        null| 78|
|    null|Honda Accord| 33|
|    null|   Honda CRV| 10|
|    null| Honda Civic| 35|
|  Dublin|        null| 33|
|  Dublin|Honda Accord| 10|
|  Dublin|   Honda CRV|  3|
|  Dublin| Honda Civic| 20|
| Fremont|        null| 32|
| Fremont|Honda Accord| 15|
| Fremont|   Honda CRV|  7|
| Fremont| Honda Civic| 10|
|San Jose|        null| 13|
|San Jose|Honda Accord|  8|
|San Jose| Honda Civic|  5|
+--------+------------+---+

两版的查询结果完全一样。

Grouping Sets 的执行计划

从执行结果上看，Grouping Sets 版本和 Union All 版本的 SQL 是等价的，但 Grouping Sets 版本更加简洁。

那么，Grouping Sets 仅仅只是 Union All 的一个缩写，或者语法糖吗 ？

为了进一步探究 Grouping Sets 的底层实现是否和 Union All 是一致的，我们可以来看下两者的执行计划。

首先，我们通过 explain extended 来查看 Union All 版本的 Optimized Logical Plan :

spark-sql> explain extended (SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY city, car_model) UNION ALL (SELECT city, NULL as car_model, sum(quantity) AS sum FROM dealer GROUP BY city) UNION ALL (SELECT NULL as city, car_model, sum(quantity) AS sum FROM dealer GROUP BY car_model) UNION ALL (SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer) ORDER BY city, car_model;
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
Sort [city#93 ASC NULLS FIRST, car_model#94 ASC NULLS FIRST], true
+- Union false, false
   :- Aggregate [city#93, car_model#94], [city#93, car_model#94, sum(quantity#95) AS sum#79L]
   :  +- Project [city#93, car_model#94, quantity#95]
   :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#92, city#93, car_model#94, quantity#95], Partition Cols: []]
   :- Aggregate [city#97], [city#97, null AS car_model#112, sum(quantity#99) AS sum#81L]
   :  +- Project [city#97, quantity#99]
   :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#96, city#97, car_model#98, quantity#99], Partition Cols: []]
   :- Aggregate [car_model#102], [null AS city#113, car_model#102, sum(quantity#103) AS sum#83L]
   :  +- Project [car_model#102, quantity#103]
   :     +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#100, city#101, car_model#102, quantity#103], Partition Cols: []]
   +- Aggregate [null AS city#114, null AS car_model#115, sum(quantity#107) AS sum#86L]
      +- Project [quantity#107]
         +- HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#104, city#105, car_model#106, quantity#107], Partition Cols: []]
== Physical Plan ==
...

从上述的 Optimized Logical Plan 可以清晰地看出 Union All 版本的执行逻辑：

执行每个子查询语句，计算得出查询结果。其中，每个查询语句的逻辑是这样的：
- 在 HiveTableRelation 节点对 dealer 表进行全表扫描。
- 在 Project 节点选出与查询语句结果相关的列，比如对于子查询语句 SELECT NULL as city, NULL as car_model, sum(quantity) AS sum FROM dealer，只需保留 quantity 列即可。
- 在 Aggregate 节点完成 quantity 列对聚合运算。在上述的 Plan 中，Aggregate 后面紧跟的就是用来分组的列，比如 Aggregate [city#902] 就表示根据 city 列来进行分组。
在 Union 节点完成对每个子查询结果的联合。
最后，在 Sort 节点完成对数据的排序，上述 Plan 中 Sort [city#93 ASC NULLS FIRST, car_model#94 ASC NULLS FIRST] 就表示根据 city 和 car_model 列进行升序排序。

接下来，我们通过 explain extended 来查看 Grouping Sets 版本的 Optimized Logical Plan：

spark-sql> explain extended SELECT city, car_model, sum(quantity) AS sum FROM dealer GROUP BY GROUPING SETS ((city, car_model), (city), (car_model), ()) ORDER BY city, car_model;
== Parsed Logical Plan ==
...
== Analyzed Logical Plan ==
...
== Optimized Logical Plan ==
Sort [city#138 ASC NULLS FIRST, car_model#139 ASC NULLS FIRST], true
+- Aggregate [city#138, car_model#139, spark_grouping_id#137L], [city#138, car_model#139, sum(quantity#133) AS sum#124L]
   +- Expand [[quantity#133, city#131, car_model#132, 0], [quantity#133, city#131, null, 1], [quantity#133, null, car_model#132, 2], [quantity#133, null, null, 3]], [quantity#133, city#138, car_model#139, spark_grouping_id#137L]
      +- Project [quantity#133, city#131, car_model#132]
         +- HiveTableRelation [`default`.`dealer`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#130, city#131, car_model#132, quantity#133], Partition Cols: []]
== Physical Plan ==
...

从 Optimized Logical Plan 来看，Grouping Sets 版本要简洁很多！具体的执行逻辑是这样的：

在 HiveTableRelation 节点对 dealer 表进行全表扫描。
在 Project 节点选出与查询语句结果相关的列。
接下来的 Expand 节点是关键，数据经过该节点后，多出了 spark_grouping_id 列。从 Plan 中可以看出来，Expand 节点包含了 Grouping Sets 里的各个 grouping set 信息，比如 [quantity#133, city#131, null, 1] 对应的就是 (city) 这一 grouping set。而且，每个 grouping set 对应的 spark_grouping_id 列的值都是固定的，比如 (city) 对应的 spark_grouping_id 为 1。
在 Aggregate 节点完成 quantity 列对聚合运算，其中分组的规则为 city, car_model, spark_grouping_id。注意，数据经过 Aggregate 节点后，spark_grouping_id 列被删除了！
最后，在 Sort 节点完成对数据的排序。

从 Optimized Logical Plan 来看，虽然 Union All 版本和 Grouping Sets 版本的效果一致，但它们的底层实现有着巨大的差别。

其中，Grouping Sets 版本的 Plan 中最关键的是 Expand 节点，目前，我们只知道数据经过它之后，多出了 spark_grouping_id 列。而且从最终结果来看，spark_grouping_id只是 Spark SQL 的内部实现细节，对用户并不体现。那么：

Expand 的实现逻辑是怎样的，为什么能达到 Union All 的效果？
Expand 节点的输出数据是怎样的 ？
spark_grouping_id 列的作用是什么 ？

通过 Physical Plan，我们发现 Expand 节点对应的算子名称也是 Expand:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [city#138 ASC NULLS FIRST, car_model#139 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(city#138 ASC NULLS FIRST, car_model#139 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=422]
      +- HashAggregate(keys=[city#138, car_model#139, spark_grouping_id#137L], functions=[sum(quantity#133)], output=[city#138, car_model#139, sum#124L])
         +- Exchange hashpartitioning(city#138, car_model#139, spark_grouping_id#137L, 200), ENSURE_REQUIREMENTS, [plan_id=419]
            +- HashAggregate(keys=[city#138, car_model#139, spark_grouping_id#137L], functions=[partial_sum(quantity#133)], output=[city#138, car_model#139, spark_grouping_id#137L, sum#141L])
               +- Expand [[quantity#133, city#131, car_model#132, 0], [quantity#133, city#131, null, 1], [quantity#133, null, car_model#132, 2], [quantity#133, null, null, 3]], [quantity#133, city#138, car_model#139, spark_grouping_id#137L]
                  +- Scan hive default.dealer [quantity#133, city#131, car_model#132], HiveTableRelation [`default`.`dealer`, ..., Data Cols: [id#130, city#131, car_model#132, quantity#133], Partition Cols: []]

带着前面的几个问题，接下来我们深入 Spark SQL 的 Expand 算子源码寻找答案。

声明：本文内容及配图由入驻作者撰写或者入驻合作网站授权转载。文章观点仅代表作者本人，不代表电子发烧友网立场。文章及其配图仅供工程师学习之用，如有内容侵权或者其他违规问题，请联系本站处理。举报投诉

数据

数据

+关注

关注
8

文章
7002

浏览量
88942
SQL

SQL

+关注

关注
1

文章
762

浏览量
44117
函数

函数

+关注

关注
3

文章
4327

浏览量
62571

在Delphi中动态地使用SQL查询语句

程序中的查询语句。假定程序的窗体中有一个名为Query1的TQuery构件，在程序运行过程中需要改变它的SQL查询

发表于 05-10 11:10

SQL语句的两种嵌套方式

一般情况下，SQL语句是嵌套在宿主语言（如C语言）中的。有两种嵌套方式：1.调用层接口（CLI）：提供一些库，库中的函数和方法实现

发表于 05-23 08:51

区分SQL语句与主语言语句

为了区分SQL语句与主语言语句，所有SQL 语句必须加前缀EXEC SQL处理过程：含嵌入式

发表于 10-28 08:44

为什么要动态sql语句？

为什么要动态sql语句？因为动态sql语句能够提供一些比较友好的机制1、可以使得一些在编译过程中

发表于 12-20 06:00

数据库SQL语句电子教程

电子发烧友为您提供了数据库SQL语句电子教程，帮助您了解数据库 SQL语句，学习读懂数据库SQL语句

发表于 07-14 17:09 •0次下载

sql语句实例讲解

SQL是用来存取关系数据库的语言，具有查询、操纵、定义和控制关系型数据库的四方面功能。常见的关系数据库有Oracle，SQLServer，DB2，Sybase。开源不收费的有MYSQL，SQLLite等。今天我们主要以MYSQL为例子，讲解SQL常用的

发表于 11-17 12:39 •9137次阅读

如何使用navicat或PHPMySQLAdmin导入SQL语句

很多朋友问我们怎么导入SQL语句，这是新人最需要知道的东西，现制作图文教程，希望对新手有所帮助，顺便文末附SQL语句导入导出大全，高手可以提供更加详细的教程。

发表于 04-10 15:06 •2次下载

VS中SQL命令语句的详细资料免费下载

本文档的主要内容详细介绍的是微软VS（Microsoft Visual Studio）中SQL命令语句的详细资料免费下载

发表于 10-09 11:45 •8次下载

如何使用SQL修复语句程序说明

本文档的主要内容详细介绍的是如何使用SQL修复语句程序说明。

发表于 10-31 15:09 •5次下载

嵌入式SQL语句

为了区分SQL语句与主语言语句，所有SQL 语句必须加前缀EXEC SQL处理过程：含嵌入式

发表于 10-21 11:51 •4次下载

Group By高级用法Groupings Sets语句的功能和底层实现

SQL 中 Group By 语句大家都很熟悉，根据指定的规则对数据进行分组，常常和聚合函数一起使用。

发表于 07-04 10:26 •3219次阅读

Java中如何解析、格式化、生成SQL语句？

昨天在群里看到有小伙伴问，Java里如何解析SQL语句然后格式化SQL，是否有现成类库可以使用？

发表于 04-10 11:59 •972次阅读

深度剖析SQL中的Grouping Sets语句2

SQL 中 `Group By` 语句大家都很熟悉， **根据指定的规则对数据进行分组** ，常常和**聚合函数**一起使用。

发表于 05-10 17:44 •591次阅读

sql查询语句大全及实例

SQL（Structured Query Language）是一种专门用于数据库管理系统的标准交互式数据库查询语言。它被广泛应用于数据库管理和数据操作领域。在本文中，我们将为您详细介绍SQL查询语句

发表于 11-17 15:06 •1482次阅读

oracle执行sql查询语句的步骤是什么

Oracle数据库是一种常用的关系型数据库管理系统，具有强大的SQL查询功能。Oracle执行SQL查询语句的步骤包括编写SQL语句、解析

发表于 12-06 10:49 •953次阅读

搜索历史

深度剖析SQL中的Grouping Sets语句1

前言

Grouping Sets 简介

Grouping Sets 的执行计划

评论

在Delphi中动态地使用SQL查询语句

SQL语句的两种嵌套方式

区分SQL语句与主语言语句

为什么要动态sql语句？

数据库SQL语句电子教程

sql语句实例讲解

如何使用navicat或PHPMySQLAdmin导入SQL语句

VS中SQL命令语句的详细资料免费下载

如何使用SQL修复语句程序说明

嵌入式SQL语句

Group By高级用法Groupings Sets语句的功能和底层实现

Java中如何解析、格式化、生成SQL语句？

深度剖析SQL中的Grouping Sets语句2

sql查询语句大全及实例

oracle执行sql查询语句的步骤是什么