使用PostgreSQL无限递归 SELECT 评论系统

假设有个评论系统,要求支持无限层级的回复,就像一棵树那样

1
2
3
4
5
文章
/ \
/ \
评论1 评论2
....

注意可以有任意个子树以及做任意个叶子

大意的表结构

1
2
3
4
5
6
7
8
create table comments (
comment_id serial primary key,
parent_id bigint,
bug_id bigint not null,
author varchar(20) not null,
comment text not null,
foreign key (parent_id) references comments(comment_id)
);
1
2
3
4
5
6
7
8
9
10
11
test=# select * from comments;
comment_id | parent_id | bug_id | author | comment
------------+-----------+--------+--------+---------------------
1 | | 1 | Fran | 这个bug的成因是什么
2 | 1 | 1 | Ollie | 我觉得是一个空指针
3 | 2 | 1 | Fran | 不,我查过了
4 | 1 | 1 | Kukla | 我们需要查无效输入
5 | 4 | 1 | Ollie | 是的,那是个问题
6 | 4 | 1 | Fran | 好,查一下吧
7 | 6 | 1 | Kukla | 解决了
(7 rows)

SQL语句

利用递归查询,可以查某篇文章评论组成的树结构。其中 depth是树的深度,显示的时候,按已经排序好的层次及相应的父结点显示出来就可以了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
test=# with RECURSIVE commenttree (comment_id, bug_id, parent_id, author, comment, depth) as (select comment_id, bug_id, parent_id, author, comment , 0 as depth from comments where parent_id is null union all select c.comment_id, c.bug_id, c.parent_id, c.author, c.comment, ct.depth+1 as depth from commenttree as ct join comments as c on (ct.comment_id = c.parent_id)) select * from commenttree where bug_id = 1;;
comment_id | bug_id | parent_id | author | comment | depth
------------+--------+-----------+--------+---------------------+-------
1 | 1 | | Fran | 这个bug的成因是什么 | 0
2 | 1 | 1 | Ollie | 我觉得是一个空指针 | 1
4 | 1 | 1 | Kukla | 我们需要查无效输入 | 1
3 | 1 | 2 | Fran | 不,我查过了 | 2
5 | 1 | 4 | Ollie | 是的,那是个问题 | 2
6 | 1 | 4 | Fran | 好,查一下吧 | 2
7 | 1 | 6 | Kukla | 解决了 | 3
(7 rows)
test=#

注意PostgreSQL里,必须加 RECURSIVE 才能支持递归。

内容来源资料:
[1]《SQL反模式》

PostgreSQL使用 postgres_fdw 进行跨库操作

说明

postgres_fdw 模块提供了 远程-数据 包装器 postgres_fdw,它能够用于访问一些保存在外部 PostgreSQL 服务器的数据。

该模块提供的功能很大程度上与以前的dblink模块重叠。但是postgres_fdw 提供了更加透明和符合标准的语法来访问远程表,并且能够在某些情况下有更好的性能。

安装extension

cd contrib/postgres_fdw
make install

安装到数据库

1
2
3
test=# CREATE EXTENSION postgres_fdw;
CREATE EXTENSION
test=#

检查是否安装成功

1
2
3
4
5
6
7
8
test=# \dx postgres_fdw;
List of installed extensions
Name | Version | Schema | Description
--------------+---------+--------+----------------------------------------------------
postgres_fdw | 1.0 | public | foreign-data wrapper for remote PostgreSQL servers
(1 row)
test=#

创建并查看一个远程服务器

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
test=# \dx postgres_fdw;
List of installed extensions
Name | Version | Schema | Description
--------------+---------+--------+----------------------------------------------------
postgres_fdw | 1.0 | public | foreign-data wrapper for remote PostgreSQL servers
(1 row)
test=# CREATE SERVER postgres_server FOREIGN DATA WRAPPER postgres_fdw OPTIONS (host '10.0.0.10', port '5432', dbname 'test');
CREATE SERVER
test=# \des;
List of foreign servers
Name | Owner | Foreign-data wrapper
-----------------+----------+----------------------
postgres_server | postgres | postgres_fdw
(1 row)
test=#

创建一个远程映射用户

1
CREATE USER MAPPING FOR PUBLIC SERVER postgres_server OPTIONS (user 'postgres', password 'yang');

user,password 是远程数据库上的用户名和密码

创建一个远程映射表

1
2
3
test=# CREATE FOREIGN TABLE tcost_foreign (path integer, cost numeric) SERVER postgres_server OPTIONS (schema_name 'public', table_name 'tcost');
CREATE FOREIGN TABLE
test=#

操作

像它在本地表一样操作就可以了。

资料:
[1] 官网

PostgreSQL窗口函数中 ROWS 和 RANGE 模式的区别

示例表数据如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
test=# select * from tcost ;
path | cost
------+-------
111 | 23.3
111 | 33.4
111 | 3.4
222 | 3.4
222 | 33.4
222 | 333.4
32 | 3.4
32 | 0.4
32 | 0.04
(9 rows)
test=#

ROWS

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (order by cost desc rows between current row and 2 following ) as row from tcost;
path | cost | sum_cost | row
------+-------+----------+-------
222 | 333.4 | 333.4 | 400.2
111 | 33.4 | 400.2 | 90.1
222 | 33.4 | 400.2 | 60.1
111 | 23.3 | 423.5 | 30.1
111 | 3.4 | 433.7 | 10.2
32 | 3.4 | 433.7 | 7.2
222 | 3.4 | 433.7 | 3.84
32 | 0.4 | 434.1 | 0.44
32 | 0.04 | 434.14 | 0.04
(9 rows)
test=#

可以看到 max2 的值都是由 current row (当前行) and 2 following(紧接着2行)的sum()结果出来的。

400.2 = 333.4 + 33.4 + 33.4 得出的。
90.1 = 33.4 + 33.4 + 23.3 得出的。

注意,上面那条SQL没有写 partition by ,那默认情况下就是以整个表来表示窗口化的,即只有一个窗口。现在试着,添加上partition by 语句的结果看看:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (partition by path order by cost desc rows between current row and 2 following ) as row from tcost;
path | cost | sum_cost | row
------+-------+----------+-------
32 | 3.4 | 433.7 | 3.84
32 | 0.4 | 434.1 | 0.44
32 | 0.04 | 434.14 | 0.04
111 | 33.4 | 400.2 | 60.1
111 | 23.3 | 423.5 | 26.7
111 | 3.4 | 433.7 | 3.4
222 | 333.4 | 333.4 | 370.2
222 | 33.4 | 400.2 | 36.8
222 | 3.4 | 433.7 | 3.4
(9 rows)
test=#

当 partition by path时,可以看到,这些ROWS 模式,都是在当前所在的窗口来进行的,并不会跨窗口来进行。
所以,在这里强调一下,ROWS表示的是物理行。

RANGE

看看,当是RANGE时的结果

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (order by cost desc range between current row and UNBOUNDED following ) as range from tcost;
path | cost | sum_cost | range
------+-------+----------+--------
222 | 333.4 | 333.4 | 434.14
111 | 33.4 | 400.2 | 100.74
222 | 33.4 | 400.2 | 100.74
111 | 23.3 | 423.5 | 33.94
111 | 3.4 | 433.7 | 10.64
32 | 3.4 | 433.7 | 10.64
222 | 3.4 | 433.7 | 10.64
32 | 0.4 | 434.1 | 0.44
32 | 0.04 | 434.14 | 0.04
(9 rows)
test=#

可以看到,RANGE时,相同数据的会被合并到一起再来进行计算,也表明,列中具有相同值的range的值也是相同的,并且结果是它们合并后进行计算后的结果。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (order by cost desc range between current row and UNBOUNDED following ) as range,sum(cost) over (order by cost desc rows between current row and UNBOUNDED following ) as row from tcost;
path | cost | sum_cost | range | row
------+-------+----------+--------+--------
222 | 333.4 | 333.4 | 434.14 | 434.14
111 | 33.4 | 400.2 | 100.74 | 100.74
222 | 33.4 | 400.2 | 100.74 | 67.34
111 | 23.3 | 423.5 | 33.94 | 33.94
111 | 3.4 | 433.7 | 10.64 | 10.64
32 | 3.4 | 433.7 | 10.64 | 7.24
222 | 3.4 | 433.7 | 10.64 | 3.84
32 | 0.4 | 434.1 | 0.44 | 0.44
32 | 0.04 | 434.14 | 0.04 | 0.04
(9 rows)
test=#

这里可以非常明显看到RANGE和ROWS的区别。
ROWS:是按物理行来进行区分的
RANGE:是按数值进行逻辑区分的

RANGE 和 ROWS 在PostgreSQL中的语法

1
2
[ RANGE | ROWS ] frame_start
[ RANGE | ROWS ] BETWEEN frame_start AND frame_end

frame_start 和 frame_end可以是:

1
2
3
4
5
UNBOUNDED PRECEDING
value PRECEDING
CURRENT ROW
value FOLLOWING
UNBOUNDED FOLLOWING

特别注意:value PRECEDING和value FOLLOWING 当前只允许ROWS模式。
RANGE模式后面只能接 UNBOUNDED FOLLOWING。

默认的框架选项是RANGE UNBOUNDED PRECEDING,该选项与 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW相同。有ORDER BY, 它设置框架从分区的开始一直到与当前行相同的最后一行。没有ORDER BY, 那么就是当前分区的所有行都包含在框架中,因为所有行都会成为当前行的相同行。

RANGE时,请注意有没有 order by 的区别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (range between current row and UNBOUNDED following ) as no_order_by_range,sum(cost) over (order by cost desc range between current row and UNBOUNDED following ) as has_order_by_range from tcost;
path | cost | sum_cost | no_order_by_range | has_order_by_range
------+-------+----------+-------------------+--------------------
222 | 333.4 | 333.4 | 434.14 | 434.14
111 | 33.4 | 400.2 | 434.14 | 100.74
222 | 33.4 | 400.2 | 434.14 | 100.74
111 | 23.3 | 423.5 | 434.14 | 33.94
111 | 3.4 | 433.7 | 434.14 | 10.64
32 | 3.4 | 433.7 | 434.14 | 10.64
222 | 3.4 | 433.7 | 434.14 | 10.64
32 | 0.4 | 434.1 | 434.14 | 0.44
32 | 0.04 | 434.14 | 434.14 | 0.04
(9 rows)
test=#
没有ORDER BY, 那么就是当前分区的所有行都包含在框架中,因为所有行都会成为当前行的相同行。

ROWS时,请注意有没有 order by 的区别

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
test=# select path, cost, sum(cost) over (order by cost desc) as sum_cost, sum(cost) over (rows between current row and UNBOUNDED following ) as no_order_by_rows,sum(cost) over (order by cost desc rows between current row and UNBOUNDED following ) as has_order_by_rows from tcost;
path | cost | sum_cost | no_order_by_rows | has_order_by_rows
------+-------+----------+------------------+-------------------
222 | 333.4 | 333.4 | 434.14 | 434.14
111 | 33.4 | 400.2 | 100.74 | 100.74
222 | 33.4 | 400.2 | 67.34 | 67.34
111 | 23.3 | 423.5 | 33.94 | 33.94
111 | 3.4 | 433.7 | 10.64 | 10.64
32 | 3.4 | 433.7 | 7.24 | 7.24
222 | 3.4 | 433.7 | 3.84 | 3.84
32 | 0.4 | 434.1 | 0.44 | 0.44
32 | 0.04 | 434.14 | 0.04 | 0.04
(9 rows)
test=#

有没有ORDER BY,都是一样的,因为ROWS是按物理分行的,而不是按逻辑分行的。

总结

1
2
3
4
5
6
7
ROWS:是按物理行来进行窗口级别里再次进行范围选择的。
RANGE:是按逻辑行来进行窗口级别里再次进行范围选择的。RANGE时,相同行会被合并成同一条数据再进行计算,相同行窗口计算时的结果也是相同的。
是否是相同行,是根据ORDER BY排序时的结果决定的。
ORDER BY时:同行是说在ORDER BY排序时不唯一的行。【即具有相同数值的行】
不同行是说ORDER BY排序时具有不同的数值的行。
没有ORDER BY:那么就是当前分区的所有行都包含在框架中,因为所有行都会成为当前行的相同行。【特别要注意最后一句的意思】