跳到主要内容

窗口函数

问题

什么是 SQL 窗口函数?ROW_NUMBER、RANK、DENSE_RANK 有什么区别?LAG 和 LEAD 怎么用?

答案

窗口函数(Window Function) 是 SQL 中最强大的分析工具。它能在不减少行数的情况下,对每一行计算出一个和"它周围的行"相关的值——比如排名、累计求和、与上一行的差值等。

普通聚合函数(如 SUMAVG)会把多行"压缩"成一行,窗口函数则保留原始每一行,在旁边"多算一列"。

-- 对比:普通聚合 vs 窗口函数
-- 普通聚合:3 个部门 → 3 行
SELECT department, SUM(salary) FROM employees GROUP BY department;

-- 窗口函数:5 个员工 → 仍然 5 行,但每行多了一列部门薪资总和
SELECT name, department, salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total
FROM employees;

窗口函数的语法

函数名() OVER (
[PARTITION BY 分区列] -- 可选:按什么分组(类似 GROUP BY)
[ORDER BY 排序列 [ASC|DESC]] -- 可选:窗口内按什么排序
[ROWS/RANGE 窗口范围] -- 可选:定义窗口的行范围
)
  • PARTITION BY:把数据分成多个"窗口",每个窗口内独立计算
  • ORDER BY:窗口内的行的排序顺序
  • ROWS/RANGE:定义计算范围(从哪一行到哪一行)

示例数据

idnamedepartmentsalary
1张三技术部15000
2李四市场部12000
3王五技术部18000
4赵六市场部12000
5孙七技术部20000
6周八人事部13000

排名函数

ROW_NUMBER / RANK / DENSE_RANK

这三个函数都用来排名,区别在于处理并列的方式:

SELECT name, department, salary,
ROW_NUMBER() OVER (ORDER BY salary DESC) AS row_num,
RANK() OVER (ORDER BY salary DESC) AS rank_val,
DENSE_RANK() OVER (ORDER BY salary DESC) AS dense_rank_val
FROM employees;

结果:

namesalaryrow_numrank_valdense_rank_val
孙七20000111
王五18000222
张三15000333
周八13000444
李四12000555
赵六12000655

注意李四和赵六薪资相同(并列),三种函数的处理不同:

函数并列处理下一名助记
ROW_NUMBER强制不同编号(1,2,3,4,5,6连续"行号"——每行给个唯一编号
RANK相同排名(1,2,3,4,5,5跳过(下一个是 7)"比赛排名"——并列第5,没有第6
DENSE_RANK相同排名(1,2,3,4,5,5不跳过(下一个是 6)"密集排名"——并列第5,第6继续

分区排名

配合 PARTITION BY,在每个分区内独立排名:

-- 每个部门内按薪资排名
SELECT name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS dept_rank
FROM employees;
namedepartmentsalarydept_rank
孙七技术部200001
王五技术部180002
张三技术部150003
李四市场部120001
赵六市场部120002
周八人事部130001
经典面试题模式

"找每个 X 中 Y 最大/第 N 名的记录"——套路:

SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY X ORDER BY Y DESC) AS rn
FROM table
) t
WHERE rn = 1; -- 第1名;rn = 2 就是第2名

偏移函数:LAG 和 LEAD

  • LAG(column, n):获取当前行前面第 n 行的值(默认 n=1)
  • LEAD(column, n):获取当前行后面第 n 行的值(默认 n=1)

这两个函数是计算环比、同比的核心工具。

-- 每月销售额与上月对比
SELECT
month,
revenue,
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month_revenue,
revenue - LAG(revenue, 1) OVER (ORDER BY month) AS month_change
FROM monthly_sales;

结果:

monthrevenueprev_month_revenuemonth_change
2024-01100000NULLNULL
2024-0212000010000020000
2024-03110000120000-10000
2024-0415000011000040000
-- 计算环比增长率
SELECT
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
/ LAG(revenue) OVER (ORDER BY month),
2) AS growth_rate_pct
FROM monthly_sales;

聚合窗口函数

普通聚合函数也可以作为窗口函数使用,配合 OVER() 实现累计计算移动平均

累计求和

SELECT 
month,
revenue,
-- 从第一行到当前行的累计和
SUM(revenue) OVER (ORDER BY month) AS running_total
FROM monthly_sales;
monthrevenuerunning_total
2024-01100000100000
2024-02120000220000
2024-03110000330000
2024-04150000480000

移动平均

SELECT 
month,
revenue,
-- 最近 3 个月的移动平均(当前行 + 前 2 行)
AVG(revenue) OVER (
ORDER BY month
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
) AS moving_avg_3
FROM monthly_sales;

窗口范围(Frame)说明

-- ROWS BETWEEN 定义窗口的行范围
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW -- 从第一行到当前行(默认)
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW -- 前 2 行到当前行
ROWS BETWEEN 1 PRECEDING AND 1 FOLLOWING -- 前 1 行到后 1 行
ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING -- 当前行到最后一行

NTILE 分桶

将数据均分为 N 组:

-- 将员工薪资分为 4 个等级(四分位)
SELECT name, salary,
NTILE(4) OVER (ORDER BY salary) AS quartile
FROM employees;
namesalaryquartile
李四120001
赵六120001
周八130002
张三150002
王五180003
孙七200004

FIRST_VALUE / LAST_VALUE

获取窗口中第一个/最后一个值:

-- 每个部门薪资最高的人是谁
SELECT name, department, salary,
FIRST_VALUE(name) OVER (PARTITION BY department ORDER BY salary DESC) AS top_earner
FROM employees;
LAST_VALUE 的陷阱

LAST_VALUE 默认的窗口范围是 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW,所以"最后一个值"其实就是当前行自己!

要真正获取窗口中的最后一个值,需要改窗口范围:

SELECT name, department, salary,
LAST_VALUE(name) OVER (
PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING -- 必须指定为全窗口
) AS lowest_earner
FROM employees;

NTH_VALUE

获取窗口中第 N 行的值:

-- 每个部门薪资第 2 高的人的薪资
SELECT name, department, salary,
NTH_VALUE(salary, 2) OVER (
PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
) AS second_highest_salary
FROM employees;

PERCENT_RANK 和 CUME_DIST

这两个函数用于计算相对排名位置,常用于百分位分析。

SELECT name, salary,
-- PERCENT_RANK: (rank - 1) / (total - 1),值域 [0, 1]
ROUND(PERCENT_RANK() OVER (ORDER BY salary) * 100, 2) AS percent_rank,
-- CUME_DIST: 小于等于当前值的行数 / 总行数,值域 (0, 1]
ROUND(CUME_DIST() OVER (ORDER BY salary) * 100, 2) AS cume_dist
FROM employees;
namesalarypercent_rankcume_dist
李四120000.0033.33
赵六120000.0033.33
周八1300040.0050.00
张三1500060.0066.67
王五1800080.0083.33
孙七20000100.00100.00
业务含义
  • PERCENT_RANK = 80% → 此人薪资超过了 80% 的员工
  • CUME_DIST = 83.33% → 83.33% 的员工薪资 ≤ 此人

ROWS vs RANGE

窗口 Frame 有两种模式,理解差异很重要:

模式含义处理相同值
ROWS按物理行计算相同值的行各自独立
RANGE按逻辑值范围计算相同值的行视为一组
-- 假设数据:salary = [12000, 12000, 13000, 15000]

-- ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- 第1行(12000):sum = 12000
-- 第2行(12000):sum = 24000(包含前2行)

-- RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
-- 第1行(12000):sum = 24000(包含所有 salary ≤ 12000 的行)
-- 第2行(12000):sum = 24000(同上,因为值相同,范围相同)
ORDER BY 时的默认 Frame
  • ORDER BY 时,默认是 RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
  • 没有 ORDER BY 时,默认是 ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING(整个分区)

数据分析实战场景

场景一:留存分析

-- 计算每个注册日期的次日留存率和 7 日留存率
WITH new_users AS (
SELECT user_id, DATE(MIN(login_date)) AS register_date
FROM login_log
GROUP BY user_id
),
retention AS (
SELECT
n.register_date,
COUNT(DISTINCT n.user_id) AS new_user_count,
COUNT(DISTINCT CASE
WHEN DATEDIFF(l.login_date, n.register_date) = 1 THEN n.user_id
END) AS day1_retained,
COUNT(DISTINCT CASE
WHEN DATEDIFF(l.login_date, n.register_date) = 7 THEN n.user_id
END) AS day7_retained
FROM new_users n
LEFT JOIN login_log l ON n.user_id = l.user_id
GROUP BY n.register_date
)
SELECT *,
ROUND(day1_retained * 100.0 / new_user_count, 2) AS day1_rate,
ROUND(day7_retained * 100.0 / new_user_count, 2) AS day7_rate,
-- 用窗口函数看留存率趋势
AVG(ROUND(day1_retained * 100.0 / new_user_count, 2))
OVER (ORDER BY register_date ROWS BETWEEN 6 PRECEDING AND CURRENT ROW) AS day1_rate_7d_avg
FROM retention
ORDER BY register_date;

场景二:同比环比分析

-- 每月销售额 + 环比 + 同比
WITH monthly AS (
SELECT
DATE_FORMAT(order_date, '%Y-%m') AS month,
SUM(amount) AS revenue
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
)
SELECT
month,
revenue,
-- 环比(与上月对比)
LAG(revenue, 1) OVER (ORDER BY month) AS prev_month,
ROUND((revenue - LAG(revenue, 1) OVER (ORDER BY month)) * 100.0 /
NULLIF(LAG(revenue, 1) OVER (ORDER BY month), 0), 2) AS mom_pct,
-- 同比(与去年同月对比)
LAG(revenue, 12) OVER (ORDER BY month) AS prev_year_same_month,
ROUND((revenue - LAG(revenue, 12) OVER (ORDER BY month)) * 100.0 /
NULLIF(LAG(revenue, 12) OVER (ORDER BY month), 0), 2) AS yoy_pct,
-- 累计年度营收
SUM(revenue) OVER (
PARTITION BY LEFT(month, 4) ORDER BY month
) AS ytd_revenue
FROM monthly
ORDER BY month;

场景三:连续 N 天问题

-- 查出连续登录 3 天及以上的用户和他们的连续登录段
WITH daily_login AS (
-- 去重:每个用户每天只保留一条
SELECT DISTINCT user_id, DATE(login_date) AS login_date FROM login_log
),
ranked AS (
SELECT *,
-- 给每个用户的登录日期编号
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
FROM daily_login
),
grouped AS (
SELECT *,
-- 连续日期减去行号 = 相同的值(分组标识)
DATE_SUB(login_date, INTERVAL rn DAY) AS grp
FROM ranked
)
SELECT
user_id,
MIN(login_date) AS streak_start,
MAX(login_date) AS streak_end,
COUNT(*) AS consecutive_days
FROM grouped
GROUP BY user_id, grp
HAVING COUNT(*) >= 3
ORDER BY user_id, streak_start;
连续问题的核心技巧

对于连续日期/连续数字类问题,核心思路是:

  1. ROW_NUMBER() 给记录编号
  2. 日期(或数字)- 行号 = 分组标识
  3. 按分组标识 GROUP BYCOUNT(*) 就是连续长度

这个技巧叫做 "差值分组法",是数据分析面试的高频题型。

场景四:移动平均与趋势分析

-- 7 日移动平均 DAU
SELECT
login_date,
dau,
-- 7 日简单移动平均
ROUND(AVG(dau) OVER (
ORDER BY login_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0) AS dau_7d_avg,
-- 与 7 日平均的偏差
dau - ROUND(AVG(dau) OVER (
ORDER BY login_date
ROWS BETWEEN 6 PRECEDING AND CURRENT ROW
), 0) AS deviation
FROM (
SELECT DATE(login_date) AS login_date, COUNT(DISTINCT user_id) AS dau
FROM login_log
GROUP BY DATE(login_date)
) daily_dau
ORDER BY login_date;

场景五:分组占比分析

-- 每个部门的员工薪资占部门总薪资的比例
SELECT
name, department, salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total,
ROUND(salary * 100.0 / SUM(salary) OVER (PARTITION BY department), 2) AS salary_pct,
-- 薪资从高到低的累计占比(帕累托分析)
ROUND(
SUM(salary) OVER (PARTITION BY department ORDER BY salary DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) * 100.0 /
SUM(salary) OVER (PARTITION BY department), 2
) AS cumulative_pct
FROM employees
ORDER BY department, salary DESC;

常见面试问题

Q1: 写一条 SQL:查出每个部门薪资前 3 名的员工

答案

SELECT * FROM (
SELECT
name, department, salary,
ROW_NUMBER() OVER (PARTITION BY department ORDER BY salary DESC) AS rn
FROM employees
) t
WHERE rn <= 3;

如果允许并列,用 DENSE_RANK 替换 ROW_NUMBER

Q2: ROW_NUMBER、RANK、DENSE_RANK 的区别?

答案

处理薪资 [100, 90, 90, 80] 的排名结果:

函数排名结果特点
ROW_NUMBER1, 2, 3, 4强制唯一,不并列
RANK1, 2, 2, 4并列后跳过
DENSE_RANK1, 2, 2, 3并列后不跳过

选择建议:

  • 取 Top N 且不要重复 → ROW_NUMBER
  • 取 Top N 允许并列且要跳号 → RANK
  • 取 Top N 允许并列且不跳号 → DENSE_RANK

Q3: 计算每个用户连续登录的最大天数

答案

经典的"连续"问题,核心思路:用 ROW_NUMBER 给日期编号,如果日期是连续的,那么 日期 - 行号 是一个常数。

WITH login_ranked AS (
SELECT DISTINCT user_id, login_date,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY login_date) AS rn
FROM login_log
),
login_groups AS (
SELECT user_id, login_date,
-- 连续日期的 login_date - rn 相同(构成一个组)
DATE_SUB(login_date, INTERVAL rn DAY) AS grp
FROM login_ranked
)
SELECT user_id, MAX(consecutive_days) AS max_consecutive_days
FROM (
SELECT user_id, grp, COUNT(*) AS consecutive_days
FROM login_groups
GROUP BY user_id, grp
) t
GROUP BY user_id;

Q4: 计算每月销售额的环比增长率

答案

SELECT 
month,
revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
CASE
WHEN LAG(revenue) OVER (ORDER BY month) IS NULL THEN NULL
WHEN LAG(revenue) OVER (ORDER BY month) = 0 THEN NULL
ELSE ROUND(
(revenue - LAG(revenue) OVER (ORDER BY month)) * 100.0
/ LAG(revenue) OVER (ORDER BY month), 2
)
END AS growth_rate_pct
FROM monthly_sales;

Q5: 窗口函数和 GROUP BY 有什么区别?

答案

对比GROUP BY + 聚合窗口函数
结果行数减少(每组一行)保持不变(每行一行)
原始列只能显示分组列和聚合结果可以显示所有原始列
排名功能不支持支持(ROW_NUMBER 等)
偏移访问不支持支持(LAG/LEAD)
累计计算需要自连接原生支持
-- GROUP BY:3 行(每个部门一行)
SELECT department, SUM(salary) FROM employees GROUP BY department;

-- 窗口函数:6 行(每个员工一行,但多一列部门薪资总和)
SELECT name, department, salary,
SUM(salary) OVER (PARTITION BY department) AS dept_total
FROM employees;

Q6: 面试题——删除表中重复数据,只保留 id 最小的那条

答案

-- 先找出需要删除的行(每组中 rn > 1 的行)
DELETE FROM users WHERE id IN (
SELECT id FROM (
SELECT id, ROW_NUMBER() OVER (PARTITION BY email ORDER BY id) AS rn
FROM users
) t
WHERE rn > 1
);

Q7: 窗口函数能和 WHERE / HAVING 一起用吗?

答案

窗口函数在 SQL 执行顺序中位于 SELECT 阶段,在 WHERE 和 HAVING 之后执行,所以:

-- ❌ 错误:不能在 WHERE 中使用窗口函数
SELECT *, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees
WHERE rn <= 3; -- 报错!rn 在 WHERE 执行时还不存在

-- ✅ 正确:用子查询包一层
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (ORDER BY salary DESC) AS rn
FROM employees
) t
WHERE rn <= 3;

SQL 执行顺序:FROM → WHERE → GROUP BY → HAVING → SELECT(窗口函数) → ORDER BY → LIMIT

Q8: 面试题——计算每个产品的累计销售额

答案

-- 按日期累计每个产品的销售额
SELECT
product_id,
order_date,
amount,
SUM(amount) OVER (
PARTITION BY product_id
ORDER BY order_date
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) AS cumulative_amount
FROM sales
ORDER BY product_id, order_date;

Q9: 面试题——求中位数

答案

-- 方法 1:用 PERCENT_RANK(PostgreSQL)
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees;

-- 方法 2:用 ROW_NUMBER(通用)
SELECT AVG(salary) AS median_salary FROM (
SELECT salary,
ROW_NUMBER() OVER (ORDER BY salary) AS rn,
COUNT(*) OVER () AS total
FROM employees
) t
WHERE rn IN (FLOOR((total + 1) / 2.0), CEIL((total + 1) / 2.0));
-- 总数为奇数取中间那个,偶数取中间两个的平均

Q10: 面试题——找出薪资在本部门排名前 20% 的员工

答案

-- 方法 1:NTILE 分 5 桶,取第 5 桶(前 20%)
SELECT * FROM (
SELECT *, NTILE(5) OVER (PARTITION BY department ORDER BY salary DESC) AS quintile
FROM employees
) t
WHERE quintile = 1;

-- 方法 2:PERCENT_RANK
SELECT * FROM (
SELECT *, PERCENT_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS pct_rank
FROM employees
) t
WHERE pct_rank <= 0.2;

窗口函数速查表

函数用途需要 ORDER BY
ROW_NUMBER()行号(唯一)
RANK()排名(跳号)
DENSE_RANK()排名(不跳号)
NTILE(n)分 n 桶
LAG(col, n)前 n 行的值
LEAD(col, n)后 n 行的值
FIRST_VALUE(col)第一行的值
LAST_VALUE(col)最后一行的值✅(注意 Frame)
NTH_VALUE(col, n)第 n 行的值✅(注意 Frame)
PERCENT_RANK()百分位排名
CUME_DIST()累计分布
SUM() OVER()窗口求和可选
AVG() OVER()窗口平均可选
COUNT() OVER()窗口计数可选
MAX() OVER()窗口最大可选
MIN() OVER()窗口最小可选

相关链接