12月 23 2014

PostgreSQL上选择MAX（COUNT）的数据出来

昨天晚上，和同事讨论了一个SQL的问题，是如何选择根据某字段分组，然后取出MAX COUNT(XX) 值的数据出来。例如数据是：

test=# select * from tgroup;
id | age | point
----+-----+-------
  1 |   1 |    11
  2 |   1 |    32
  3 |   2 |    32
  4 |   2 |    13
  5 |   2 |    33
  6 |   2 |    38
  7 |   3 |    38
  8 |   2 |    38
  9 |   2 |    38
 10 |   2 |    38
 11 |   2 |    38
(11 rows)
 
test=#

现在要选择出根据 POINT 分组里包含个数最大的值。大概意思是：MAX(COUNT(*)) FROM tgroup GROUP BY POINT;
所以，一开始，我们的SQL语句是（失败）：

1	select MAX(COUNT(*)) FROM tgroup group by point;

然而，我们得出的错误是：

1	aggregate function calls cannot be nested

原来，聚集函数是不能嵌套调用的。

然后又想到能不能使用子查询来完成。SQL如下（成功）：

1	select max(sum) from ( select count(*) as sum from tgroup group by point) as t;

再来一条，不使用子查询，而是使用ORDER BY 结合 LIMIT 来完成，SQL语句如下（成功):

1	select count(*) as sum from tgroup group by point order by sum desc limit 1;

最后使用PG里的CTE表达式最容易理解（成功）：

1	with cte as (select count(*) over (partition by point) from tgroup) select max(count) from cte;

那他们的性能是何呢？测试了一千一百五十多万的数据。每条SQL性能如何？

test=# select count(*) from tgroup;
  count  
----------
11534336
(1 row)
 
test=#

使用子查询的性能（POINT没有索引）

test=# select max(sum) from ( select count(*) as sum from tgroup group by point) as t;
   max  
---------
6291456
(1 row)
 
Time: 3055.716 ms

ORDER BY 结合 LIMIT（POINT没有索引）

test=# select count(*) as sum from tgroup group by point order by sum desc limit 1;
   sum  
---------
6291456
(1 row)
 
Time: 3047.152 ms
test=#

使用CTE表达式（POINT没有索引）

test=#  with cte as (select count(*) over (partition by point) from tgroup) select max(count) from cte;
   max  
---------
6291456
(1 row)
 
Time: 25675.005 ms

后面为POINT添加索引，速度只有CTE表达式的加快了一倍（添加索引其实也不太科学，POINT的数据分布不均匀，重复的数据比较多，因为是通过 insert into select 的方式来生成大量数据的，只是想看一下添加索引后的效果）:

test=#  with cte as (select count(*) over (partition by point) from tgroup) select max(count) from cte;
   max  
---------
6291456
(1 row)
 
Time: 11735.775 ms
test=#

其他两种方式，并没有什么变化。看执行计划，其他两种依然是使用Seq Scan的方式，而添加了索引后，CTE的方式使用了 CTE Scan + IndexOnlyScan的方式。

test=# explain  with cte as (select count(*) over (partition by point) from tgroup) select max(count) from cte;
                                                QUERY PLAN                                               
----------------------------------------------------------------------------------------------------------
 Aggregate  (cost=955503.54..955503.55 rows=1 width=8)
   CTE cte
     ->  WindowAgg  (cost=0.43..695980.98 rows=11534336 width=4)
           ->  Index Only Scan using tgroup_point on tgroup  (cost=0.43..522965.94 rows=11534336 width=4)
   ->  CTE Scan on cte  (cost=0.00..230686.72 rows=11534336 width=8)
(5 rows)
 
Time: 0.909 ms

总结

看来没有 WHERE 或其他条件过滤数据而且数据量非常大的情况下，不适宜使用CTE表达式，因为它本质是一个一次性视图，生成一张这么大的视图，性能也快不到哪里去(可能使用物化视图会好点,不过没有测试过）。在大量数据情况下，还是使用普通的全表扫描比使用生成CTE再全表扫描来得快。这也应验了之前翻译篇文章的强调：CTE表达式的作用，真的不是为了加快查询速度，而仅仅是为了方便。冏 ~~。

PostgreSQL DBA

Zhiyong Yang

PostgreSQL上选择MAX（COUNT）的数据出来

使用子查询的性能（POINT没有索引）

ORDER BY 结合 LIMIT（POINT没有索引）

使用CTE表达式（POINT没有索引）

总结