窗口函数
问题
什么是 SQL 窗口函数?ROW_NUMBER、RANK、DENSE_RANK 有什么区别?LAG 和 LEAD 怎么用?
答案
窗口函数(Window Function) 是 SQL 中最强大的分析工具。它能在不减少行数的情况下,对每一行计算出一个和"它周围的行"相关的值——比如排名、累计求和、与上一行的差值等。
普通聚合函数(如 SUM、AVG)会把多行"压缩"成一行,窗口函数则保留原始每一行,在旁边"多算一列"。
-- 对比:普通聚合 vs 窗口函数
-- 普通聚合:3 个部门 → 3 行
SELECT department, SUM(salary) FROM employees GROUP BY department;
-- 窗口函数:5 个员工 → 仍然 5 行,但每行多了一列部门薪资总和
SELECT name, department, salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total
FROM employees;
窗口函数的语法
函数名() OVER (
[PARTITION BY 分区列] -- 可选:按什么分组(类似 GROUP BY)
[ORDER BY 排序列 [ASC|DESC]] -- 可选:窗口内按什么排序
[ROWS/RANGE 窗口范围] -- 可选:定义窗口的行范围
)
- PARTITION BY:把数据分成多个"窗口",每个窗口内独立计算
- ORDER BY:窗口内的行的排序顺序
- ROWS/RANGE:定义计算范围(从哪一行到哪一行)
示例数据
| id | name | department | salary |
|---|---|---|---|
| 1 | 张三 | 技术部 | 15000 |
| 2 | 李四 | 市场部 | 12000 |
| 3 | 王五 | 技术部 | 18000 |
| 4 | 赵六 | 市场部 | 12000 |
| 5 | 孙七 | 技术部 | 20000 |
| 6 | 周八 | 人事部 | 13000 |
排名函数
ROW_NUMBER / RANK / DENSE_RANK
这三个函数都用来排名,区别在于处理并列的方式:
SELECT name, department, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
RANK() OVER (ORDER BY salary DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_val
FROM employees;
结果:
| name | salary | row_num | rank_val | dense_rank_val |
|---|---|---|---|---|
| 孙七 | 20000 | 1 | 1 | 1 |
| 王五 | 18000 | 2 | 2 | 2 |
| 张三 | 15000 | 3 | 3 | 3 |
| 周八 | 13000 | 4 | 4 | 4 |
| 李四 | 12000 | 5 | 5 | 5 |
| 赵六 | 12000 | 6 | 5 | 5 |
注意李四和赵六薪资相同(并列),三种函数的处理不同:
| 函数 | 并列处理 | 下一名 | 助记 |
|---|---|---|---|
ROW_NUMBER | 强制不同编号(1,2,3,4,5,6) | 连续 | "行号"——每行给个唯一编号 |
RANK | 相同排名(1,2,3,4,5,5) | 跳过(下一个是 7) | "比赛排名"——并列第5,没有第6 |
DENSE_RANK | 相同排名(1,2,3,4,5,5) | 不跳过(下一个是 6) | "密集排名"——并列第5,第6继续 |
分区排名
配合 PARTITION BY,在每个分区内独立排名:
-- 每个部门内按薪资排名
SELECT name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
| name | department | salary | dept_rank |
|---|---|---|---|
| 孙七 | 技术部 | 20000 | 1 |
| 王五 | 技术部 | 18000 | 2 |
| 张三 | 技术部 | 15000 | 3 |
| 李四 | 市场部 | 12000 | 1 |
| 赵六 | 市场部 | 12000 | 2 |
| 周八 | 人事部 | 13000 | 1 |
"找每个 X 中 Y 最大/第 N 名的记录"——套路:
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY X ORDER BY Y DESC) AS rn
FROM table
) t
WHERE rn = 1; -- 第1名;rn = 2 就是第2名
偏移函数:LAG 和 LEAD
LAG(column, n):获取当前行前面第 n 行的值(默认 n=1)LEAD(column, n):获取当前行后面第 n 行的值(默认 n=1)
这两个函数是计算环比、同比的核心工具。
-- 每月销售额与上月对比
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS month_change
FROM monthly_sales;
结果:
| month | revenue | prev_month_revenue | month_change |
|---|---|---|---|
| 2024-01 | 100000 | NULL | NULL |
| 2024-02 | 120000 | 100000 | 20000 |
| 2024-03 | 110000 | 120000 | -10000 |
| 2024-04 | 150000 | 110000 | 40000 |
-- 计算环比增长率
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
/ LAG(revenue) OVER (ORDER BY month),
2) AS growth_rate_pct
FROM monthly_sales;
聚合窗口函数
普通聚合函数也可以作为窗口函数使用,配合 OVER() 实现累计计算和移动平均。
累计求和
SELECT
month,
revenue,
-- 从第一行到当前行的累计和
SUM(revenue) OVER (ORDER BY month) AS running_total
FROM monthly_sales;
| month | revenue | running_total |
|---|---|---|
| 2024-01 | 100000 | 100000 |
| 2024-02 | 120000 | 220000 |
| 2024-03 | 110000 | 330000 |
| 2024-04 | 150000 | 480000 |
移动平均
SELECT
month,
revenue,
-- 最近 3 个月的移动平均(当前行 + 前 2 行)
AVG(revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3
FROM monthly_sales;
窗口范围(Frame)说明
-- ROWS BETWEEN 定义窗口的行范围
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- 从第一行到当前行(默认)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- 前 2 行到当前行
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING -- 前 1 行到后 1 行
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- 当前行到最后一行
NTILE 分桶
将数据均分为 N 组:
-- 将员工薪资分为 4 个等级(四分位)
SELECT name, salary,
NTILE(4) OVER (ORDER BY salary) AS quartile
FROM employees;
| name | salary | quartile |
|---|---|---|
| 李四 | 12000 | 1 |
| 赵六 | 12000 | 1 |
| 周八 | 13000 | 2 |
| 张三 | 15000 | 2 |
| 王五 | 18000 | 3 |
| 孙七 | 20000 | 4 |
FIRST_VALUE / LAST_VALUE
获取窗口中第一个/最后一个值:
-- 每个部门薪资最高的人是谁
SELECT name, department, salary,
FIRST_VALUE(name) OVER (PARTITION BY department ORDER BY salary DESC) AS top_earner
FROM employees;
LAST_VALUE 默认的窗口范围是 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,所以"最后一个值"其实就是当前行自己!
要真正获取窗口中的最后一个值,需要改窗口范围:
SELECT name, department, salary,
LAST_VALUE(name) OVER (
PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING -- 必须指定为全窗口
) AS lowest_earner
FROM employees;
NTH_VALUE
获取窗口中第 N 行的值:
-- 每个部门薪资第 2 高的人的薪资
SELECT name, department, salary,
NTH_VALUE(salary, 2) OVER (
PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS second_highest_salary
FROM employees;
PERCENT_RANK 和 CUME_DIST
这两个函数用于计算相对排名位置,常用于百分位分析。
SELECT name, salary,
-- PERCENT_RANK: (rank - 1) / (total - 1),值域 [0, 1]
ROUND(PERCENT_RANK() OVER (ORDER BY salary) * 100, 2) AS percent_rank,
-- CUME_DIST: 小于等于当前值的行数 / 总行数,值域 (0, 1]
ROUND(CUME_DIST() OVER (ORDER BY salary) * 100, 2) AS cume_dist
FROM employees;
| name | salary | percent_rank | cume_dist |
|---|---|---|---|
| 李四 | 12000 | 0.00 | 33.33 |
| 赵六 | 12000 | 0.00 | 33.33 |
| 周八 | 13000 | 40.00 | 50.00 |
| 张三 | 15000 | 60.00 | 66.67 |
| 王五 | 18000 | 80.00 | 83.33 |
| 孙七 | 20000 | 100.00 | 100.00 |
PERCENT_RANK = 80%→ 此人薪资超过了 80% 的员工CUME_DIST = 83.33%→ 83.33% 的员工薪资 ≤ 此人
ROWS vs RANGE
窗口 Frame 有两种模式,理解差异很重要:
| 模式 | 含义 | 处理相同值 |
|---|---|---|
ROWS | 按物理行计算 | 相同值的行各自独立 |
RANGE | 按逻辑值范围计算 | 相同值的行视为一组 |
-- 假设数据:salary = [12000, 12000, 13000, 15000]
-- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- 第1行(12000):sum = 12000
-- 第2行(12000):sum = 24000(包含前2行)
-- RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- 第1行(12000):sum = 24000(包含所有 salary ≤ 12000 的行)
-- 第2行(12000):sum = 24000(同上,因为值相同,范围相同)
- 有
ORDER BY时,默认是RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW - 没有
ORDER BY时,默认是ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING(整个分区)
数据分析实战场景
场景一:留存分析
-- 计算每个注册日期的次日留存率和 7 日留存率
WITH new_users AS (
SELECT user_id, DATE(MIN(login_date)) AS register_date
FROM login_log
GROUP BY user_id
),
retention AS (
SELECT
n.register_date,
COUNT(DISTINCT n.user_id) AS new_user_count,
COUNT(DISTINCT CASE
WHEN DATEDIFF(l.login_date, n.register_date) = 1 THEN n.user_id
END) AS day1_retained,
COUNT(DISTINCT CASE
WHEN DATEDIFF(l.login_date, n.register_date) = 7 THEN n.user_id
END) AS day7_retained
FROM new_users n
LEFT JOIN login_log l ON n.user_id = l.user_id
GROUP BY n.register_date
)
SELECT *,
ROUND(day1_retained * 100.0 / new_user_count, 2) AS day1_rate,
ROUND(day7_retained * 100.0 / new_user_count, 2) AS day7_rate,
-- 用窗口函数看留存率趋势
AVG(ROUND(day1_retained * 100.0 / new_user_count, 2))
OVER (ORDER BY register_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS day1_rate_7d_avg
FROM retention
ORDER BY register_date;
场景二:同比环比分析
-- 每月销售额 + 环比 + 同比
WITH monthly AS (
SELECT
DATE_FORMAT(order_date, '%Y-%m') AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
)
SELECT
month,
revenue,
-- 环比(与上月对比)
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
ROUND((revenue - LAG(revenue, 1) OVER (ORDER BY month)) * 100.0 /
NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0), 2) AS mom_pct,
-- 同比(与去年同月对比)
LAG(revenue, 12) OVER (ORDER BY month) AS prev_year_same_month,
ROUND((revenue - LAG(revenue, 12) OVER (ORDER BY month)) * 100.0 /
NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0), 2) AS yoy_pct,
-- 累计年度营收
SUM(revenue) OVER (
PARTITION BY LEFT(month, 4) ORDER BY month
) AS ytd_revenue
FROM monthly
ORDER BY month;
场景三:连续 N 天问题
-- 查出连续登录 3 天及以上的用户和他们的连续登录段
WITH daily_login AS (
-- 去重:每个用户每天只保留一条
SELECT DISTINCT user_id, DATE(login_date) AS login_date FROM login_log
),
ranked AS (
SELECT *,
-- 给每个用户的登录日期编号
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
FROM daily_login
),
grouped AS (
SELECT *,
-- 连续日期减去行号 = 相同的值(分组标识)
DATE_SUB(login_date, INTERVAL rn DAY) AS grp
FROM ranked
)
SELECT
user_id,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS consecutive_days
FROM grouped
GROUP BY user_id, grp
HAVING COUNT(*) >= 3
ORDER BY user_id, streak_start;
对于连续日期/连续数字类问题,核心思路是:
- 用
ROW_NUMBER()给记录编号 日期(或数字)- 行号 = 分组标识- 按分组标识
GROUP BY,COUNT(*)就是连续长度
这个技巧叫做 "差值分组法",是数据分析面试的高频题型。
场景四:移动平均与趋势分析
-- 7 日移动平均 DAU
SELECT
login_date,
dau,
-- 7 日简单移动平均
ROUND(AVG(dau) OVER (
ORDER BY login_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0) AS dau_7d_avg,
-- 与 7 日平均的偏差
dau - ROUND(AVG(dau) OVER (
ORDER BY login_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0) AS deviation
FROM (
SELECT DATE(login_date) AS login_date, COUNT(DISTINCT user_id) AS dau
FROM login_log
GROUP BY DATE(login_date)
) daily_dau
ORDER BY login_date;
场景五:分组占比分析
-- 每个部门的员工薪资占部门总薪资的比例
SELECT
name, department, salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total,
ROUND(salary * 100.0 / SUM(salary) OVER (PARTITION BY department), 2) AS salary_pct,
-- 薪资从高到低的累计占比(帕累托分析)
ROUND(
SUM(salary) OVER (PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) * 100.0 /
SUM(salary) OVER (PARTITION BY department), 2
) AS cumulative_pct
FROM employees
ORDER BY department, salary DESC;
常见面试问题
Q1: 写一条 SQL:查出每个部门薪资前 3 名的员工
答案:
SELECT * FROM (
SELECT
name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) t
WHERE rn <= 3;
如果允许并列,用 DENSE_RANK 替换 ROW_NUMBER。
Q2: ROW_NUMBER、RANK、DENSE_RANK 的区别?
答案:
处理薪资 [100, 90, 90, 80] 的排名结果:
| 函数 | 排名结果 | 特点 |
|---|---|---|
| ROW_NUMBER | 1, 2, 3, 4 | 强制唯一,不并列 |
| RANK | 1, 2, 2, 4 | 并列后跳过 |
| DENSE_RANK | 1, 2, 2, 3 | 并列后不跳过 |
选择建议:
- 取 Top N 且不要重复 →
ROW_NUMBER - 取 Top N 允许并列且要跳号 →
RANK - 取 Top N 允许并列且不跳号 →
DENSE_RANK
Q3: 计算每个用户连续登录的最大天数
答案:
经典的"连续"问题,核心思路:用 ROW_NUMBER 给日期编号,如果日期是连续的,那么 日期 - 行号 是一个常数。
WITH login_ranked AS (
SELECT DISTINCT user_id, login_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
FROM login_log
),
login_groups AS (
SELECT user_id, login_date,
-- 连续日期的 login_date - rn 相同(构成一个组)
DATE_SUB(login_date, INTERVAL rn DAY) AS grp
FROM login_ranked
)
SELECT user_id, MAX(consecutive_days) AS max_consecutive_days
FROM (
SELECT user_id, grp, COUNT(*) AS consecutive_days
FROM login_groups
GROUP BY user_id, grp
) t
GROUP BY user_id;
Q4: 计算每月销售额的环比增长率
答案:
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
CASE
WHEN LAG(revenue) OVER (ORDER BY month) IS NULL THEN NULL
WHEN LAG(revenue) OVER (ORDER BY month) = 0 THEN NULL
ELSE ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
/ LAG(revenue) OVER (ORDER BY month), 2
)
END AS growth_rate_pct
FROM monthly_sales;
Q5: 窗口函数和 GROUP BY 有什么区别?
答案:
| 对比 | GROUP BY + 聚合 | 窗口函数 |
|---|---|---|
| 结果行数 | 减少(每组一行) | 保持不变(每行一行) |
| 原始列 | 只能显示分组列和聚合结果 | 可以显示所有原始列 |
| 排名功能 | 不支持 | 支持(ROW_NUMBER 等) |
| 偏移访问 | 不支持 | 支持(LAG/LEAD) |
| 累计计算 | 需要自连接 | 原生支持 |
-- GROUP BY:3 行(每个部门一行)
SELECT department, SUM(salary) FROM employees GROUP BY department;
-- 窗口函数:6 行(每个员工一行,但多一列部门薪资总和)
SELECT name, department, salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total
FROM employees;
Q6: 面试题——删除表中重复数据,只保留 id 最小的那条
答案:
-- 先找出需要删除的行(每组中 rn > 1 的行)
DELETE FROM users WHERE id IN (
SELECT id FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users
) t
WHERE rn > 1
);
Q7: 窗口函数能和 WHERE / HAVING 一起用吗?
答案:
窗口函数在 SQL 执行顺序中位于 SELECT 阶段,在 WHERE 和 HAVING 之后执行,所以:
-- ❌ 错误:不能在 WHERE 中使用窗口函数
SELECT *, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees
WHERE rn <= 3; -- 报错!rn 在 WHERE 执行时还不存在
-- ✅ 正确:用子查询包一层
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees
) t
WHERE rn <= 3;
SQL 执行顺序:FROM → WHERE → GROUP BY → HAVING → SELECT(窗口函数) → ORDER BY → LIMIT
Q8: 面试题——计算每个产品的累计销售额
答案:
-- 按日期累计每个产品的销售额
SELECT
product_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY product_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_amount
FROM sales
ORDER BY product_id, order_date;
Q9: 面试题——求中位数
答案:
-- 方法 1:用 PERCENT_RANK(PostgreSQL)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees;
-- 方法 2:用 ROW_NUMBER(通用)
SELECT AVG(salary) AS median_salary FROM (
SELECT salary,
ROW_NUMBER() OVER (ORDER BY salary) AS rn,
COUNT(*) OVER () AS total
FROM employees
) t
WHERE rn IN (FLOOR((total + 1) / 2.0), CEIL((total + 1) / 2.0));
-- 总数为奇数取中间那个,偶数取中间两个的平均
Q10: 面试题——找出薪资在本部门排名前 20% 的员工
答案:
-- 方法 1:NTILE 分 5 桶,取第 5 桶(前 20%)
SELECT * FROM (
SELECT *, NTILE(5) OVER (PARTITION BY department ORDER BY salary DESC) AS quintile
FROM employees
) t
WHERE quintile = 1;
-- 方法 2:PERCENT_RANK
SELECT * FROM (
SELECT *, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS pct_rank
FROM employees
) t
WHERE pct_rank <= 0.2;
窗口函数速查表
| 函数 | 用途 | 需要 ORDER BY |
|---|---|---|
ROW_NUMBER() | 行号(唯一) | ✅ |
RANK() | 排名(跳号) | ✅ |
DENSE_RANK() | 排名(不跳号) | ✅ |
NTILE(n) | 分 n 桶 | ✅ |
LAG(col, n) | 前 n 行的值 | ✅ |
LEAD(col, n) | 后 n 行的值 | ✅ |
FIRST_VALUE(col) | 第一行的值 | ✅ |
LAST_VALUE(col) | 最后一行的值 | ✅(注意 Frame) |
NTH_VALUE(col, n) | 第 n 行的值 | ✅(注意 Frame) |
PERCENT_RANK() | 百分位排名 | ✅ |
CUME_DIST() | 累计分布 | ✅ |
SUM() OVER() | 窗口求和 | 可选 |
AVG() OVER() | 窗口平均 | 可选 |
COUNT() OVER() | 窗口计数 | 可选 |
MAX() OVER() | 窗口最大 | 可选 |
MIN() OVER() | 窗口最小 | 可选 |
相关链接
- MySQL 窗口函数
- PostgreSQL 窗口函数教程
- PostgreSQL 窗口函数列表
- SQL Window Functions Cheat Sheet
- CTE 公共表表达式 - 复杂窗口查询常搭配 CTE 使用
- 聚合函数与 GROUP BY - 窗口函数与 GROUP BY 的对比
- 复杂聚合 - ROLLUP、CUBE 等多维聚合