窗口函数
问题
什么是窗口函数?有哪些常用窗口函数?窗口函数和 GROUP BY 有什么区别?
答案
一、窗口函数的概念
窗口函数对一组(窗口)行执行计算,但 不折叠行(不同于 GROUP BY 会合并行)。
-- GROUP BY:每组只返回一行
SELECT department, AVG(salary) FROM employees GROUP BY department;
-- 窗口函数:每行都保留,同时附上聚合结果
SELECT name, department, salary,
AVG(salary) OVER (PARTITION BY department) AS dept_avg
FROM employees;
核心区别
GROUP BY 合并行,窗口函数 保留行并附加计算列。
二、窗口函数语法
函数名(参数) OVER (
[PARTITION BY 分组列] -- 分窗口(类似 GROUP BY,但不折叠)
[ORDER BY 排序列 [ASC|DESC]] -- 窗口内排序
[ROWS/RANGE frame_spec] -- 窗口帧范围
)
三、常用窗口函数
1. 排名函数
SELECT name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn,
RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rnk,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS dense_rnk
FROM employees;
| 函数 | 并列处理 | 示例(分数:100, 100, 90) |
|---|---|---|
| ROW_NUMBER() | 不跳号,强制唯一 | 1, 2, 3 |
| RANK() | 并列同号,跳号 | 1, 1, 3 |
| DENSE_RANK() | 并列同号,不跳号 | 1, 1, 2 |
典型应用:取每组 Top N
-- 每个部门薪资最高的 3 个人
SELECT * FROM (
SELECT name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) t WHERE rn <= 3;
2. 偏移函数
SELECT date, revenue,
LAG(revenue, 1) OVER (ORDER BY date) AS prev_day, -- 前一行
LEAD(revenue, 1) OVER (ORDER BY date) AS next_day, -- 后一行
FIRST_VALUE(revenue) OVER (ORDER BY date) AS first_day, -- 第一行
LAST_VALUE(revenue) OVER (
ORDER BY date
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS last_day -- 最后一行
FROM daily_revenue;
| 函数 | 说明 |
|---|---|
LAG(col, n, default) | 当前行的前 n 行值 |
LEAD(col, n, default) | 当前行的后 n 行值 |
FIRST_VALUE(col) | 窗口第一行值 |
LAST_VALUE(col) | 窗口最后一行值 |
NTH_VALUE(col, n) | 窗口第 n 行值 |
典型应用:计算环比
SELECT date, revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY date) AS day_diff,
ROUND(
(revenue - LAG(revenue, 1) OVER (ORDER BY date))::numeric
/ LAG(revenue, 1) OVER (ORDER BY date) * 100, 2
) AS growth_pct
FROM daily_revenue;
3. 聚合窗口函数
普通聚合函数(SUM, AVG, COUNT, MAX, MIN)都可以作为窗口函数使用:
SELECT date, revenue,
SUM(revenue) OVER (ORDER BY date) AS running_total, -- 累计求和
AVG(revenue) OVER (
ORDER BY date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
) AS moving_avg_7d, -- 7 天移动平均
COUNT(*) OVER (PARTITION BY department) AS dept_count -- 部门人数
FROM daily_revenue;
4. 分布函数
SELECT name, salary,
NTILE(4) OVER (ORDER BY salary) AS quartile, -- 分为 4 组
PERCENT_RANK() OVER (ORDER BY salary) AS pct_rank, -- 百分比排名
CUME_DIST() OVER (ORDER BY salary) AS cum_dist -- 累积分布
FROM employees;
四、窗口帧(Frame)
窗口帧定义了计算范围:
-- ROWS:按物理行偏移
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- 前 2 行到当前行
-- RANGE:按值范围
RANGE BETWEEN INTERVAL '7 days' PRECEDING AND CURRENT ROW -- 前 7 天到当前
-- 常用帧
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- 从头到当前(累计)
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING -- 全部行
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING -- 前后各 1 行
LAST_VALUE 的陷阱
默认窗口帧是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,所以 LAST_VALUE 默认返回的是当前行(不是真正的最后一行)。
需要显式指定 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING。
五、WINDOW 子句(命名窗口)
多个函数共用同一个窗口定义时,使用命名窗口避免重复:
SELECT name, department, salary,
ROW_NUMBER() OVER w AS rn,
RANK() OVER w AS rnk,
AVG(salary) OVER w AS dept_avg
FROM employees
WINDOW w AS (PARTITION BY department ORDER BY salary DESC);
常见面试问题
Q1: ROW_NUMBER、RANK、DENSE_RANK 的区别?
答案:
以成绩 [100, 100, 90, 80] 为例:
| 值 | ROW_NUMBER | RANK | DENSE_RANK |
|---|---|---|---|
| 100 | 1 | 1 | 1 |
| 100 | 2 | 1 | 1 |
| 90 | 3 | 3 | 2 |
| 80 | 4 | 4 | 3 |
- ROW_NUMBER:唯一序号,并列随机分先后
- RANK:并列同号,后面跳号
- DENSE_RANK:并列同号,后面不跳号
Q2: 窗口函数的执行顺序?
答案:
FROM → WHERE → GROUP BY → HAVING → SELECT(窗口函数在这里) → ORDER BY → LIMIT
窗口函数在 WHERE 和 GROUP BY 之后执行,所以不能在 WHERE 中使用窗口函数的结果。需要用子查询或 CTE 包一层。
Q3: 如何用窗口函数实现连续登录天数?
答案:
WITH login_groups AS (
SELECT user_id, login_date,
login_date - ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY login_date
)::int AS grp
FROM user_logins
)
SELECT user_id, MIN(login_date) AS start_date,
MAX(login_date) AS end_date,
COUNT(*) AS consecutive_days
FROM login_groups
GROUP BY user_id, grp
HAVING COUNT(*) >= 3; -- 连续登录 3 天以上
原理:连续日期减去行号后,差值相同的属于同一连续段。