跳到主要内容

聚合函数与 GROUP BY

问题

SQL 中的聚合函数有哪些?GROUP BY 怎么用?HAVING 和 WHERE 有什么区别?

答案

聚合函数(Aggregate Functions) 用来将多行数据"聚合"为一个值——比如求总数、平均值、最大值等。配合 GROUP BY 分组后,可以实现"每个分类的统计结果",这是数据分析中最核心的能力。


五大聚合函数

函数作用示例是否忽略 NULL
COUNT()计数有多少条记录COUNT(*) 不忽略,COUNT(col) 忽略
SUM()求和总销售额
AVG()平均值平均客单价
MAX()最大值最高薪资
MIN()最小值最低价格

用示例数据演示:

idnamedepartmentsalary
1张三技术部15000
2李四市场部12000
3王五技术部18000
4赵六市场部NULL
5孙七技术部20000
-- 总人数
SELECT COUNT(*) AS total FROM employees; -- 5(所有行)
SELECT COUNT(salary) AS has_salary FROM employees; -- 4(忽略 NULL)

-- 总薪资、平均薪资
SELECT SUM(salary) AS total_salary FROM employees; -- 65000
SELECT AVG(salary) AS avg_salary FROM employees; -- 16250(65000 / 4,忽略 NULL)

-- 最高和最低薪资
SELECT MAX(salary), MIN(salary) FROM employees; -- 20000, 12000
AVG 与 NULL 的陷阱

AVG(salary) 计算的是非 NULL 值的平均,不是所有行的平均。上例中 AVG = 65000 / 4 = 16250,而不是 65000 / 5 = 13000。

如果想把 NULL 当作 0 来计算:

SELECT AVG(COALESCE(salary, 0)) FROM employees;  -- 65000 / 5 = 13000

GROUP BY 分组

GROUP BY 将数据按照指定列的值分成组,然后对每组分别计算聚合结果。

-- 每个部门的员工数量和平均薪资
SELECT
department,
COUNT(*) AS emp_count,
AVG(salary) AS avg_salary,
MAX(salary) AS max_salary
FROM employees
GROUP BY department;

结果:

departmentemp_countavg_salarymax_salary
技术部317666.6720000
市场部212000.0012000
GROUP BY 规则

使用 GROUP BY 后,SELECT 中只能出现:

  1. GROUP BY 中的列
  2. 聚合函数

不能出现其他列(因为一个分组对应多行,数据库不知道该显示哪行的值)。

-- ❌ 错误:name 不在 GROUP BY 中,也不是聚合函数
SELECT department, name, COUNT(*) FROM employees GROUP BY department;

-- ✅ 正确
SELECT department, COUNT(*) FROM employees GROUP BY department;

多列分组

-- 按部门和职级分组统计
SELECT department, level, COUNT(*) AS cnt, AVG(salary) AS avg_sal
FROM employees
GROUP BY department, level;

HAVING 过滤分组

HAVING 用来对 GROUP BY 的分组结果进行过滤,可以使用聚合函数。

-- 找出平均薪资超过 15000 的部门
SELECT department, AVG(salary) AS avg_salary
FROM employees
GROUP BY department
HAVING avg_salary > 15000;

-- 找出人数超过 2 的部门
SELECT department, COUNT(*) AS cnt
FROM employees
GROUP BY department
HAVING cnt > 2;

WHERE + GROUP BY + HAVING 完整流程

-- 需求:在薪资 > 10000 的员工中,找出平均薪资 > 15000 的部门
SELECT department, AVG(salary) AS avg_salary, COUNT(*) AS cnt
FROM employees
WHERE salary > 10000 -- 1. 先过滤掉薪资 <= 10000 的员工
GROUP BY department -- 2. 按部门分组
HAVING avg_salary > 15000; -- 3. 过滤掉平均薪资 <= 15000 的部门

执行顺序:WHERE → GROUP BY → HAVING → SELECT


实用聚合技巧

COUNT 的几种用法

-- 总行数(包含 NULL 行)
SELECT COUNT(*) FROM orders;

-- 有值的行数(排除 NULL)
SELECT COUNT(ship_date) FROM orders; -- 已发货的订单数

-- 去重计数
SELECT COUNT(DISTINCT user_id) FROM orders; -- 有多少个不同的用户下过单

-- 条件计数(统计满足特定条件的行数)
SELECT
COUNT(*) AS total,
COUNT(CASE WHEN status = 'completed' THEN 1 END) AS completed,
COUNT(CASE WHEN status = 'cancelled' THEN 1 END) AS cancelled
FROM orders;
-- 或者用 SUM
SELECT
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed,
SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled
FROM orders;

COALESCE 处理 NULL

-- COALESCE 返回第一个非 NULL 的值
SELECT name, COALESCE(salary, 0) AS salary FROM employees;
-- 赵六的 NULL 显示为 0

-- 多个备选值
SELECT COALESCE(nickname, name, 'Anonymous') AS display_name FROM users;
-- 优先用昵称,没有昵称用真名,都没有则显示 Anonymous

GROUP_CONCAT / STRING_AGG 分组拼接

-- MySQL:把每个部门的员工名用逗号拼接起来
SELECT department, GROUP_CONCAT(name ORDER BY name SEPARATOR ', ') AS members
FROM employees
GROUP BY department;
-- 技术部: 孙七, 王五, 张三
-- 市场部: 李四, 赵六

-- PostgreSQL:用 STRING_AGG
SELECT department, STRING_AGG(name, ', ' ORDER BY name) AS members
FROM employees
GROUP BY department;
GROUP_CONCAT 长度限制

MySQL 的 GROUP_CONCAT 默认最大长度是 1024 字节,超过会被截断。需要调整 group_concat_max_len 参数:

SET SESSION group_concat_max_len = 100000;

聚合时的 DISTINCT

-- COUNT(DISTINCT ...) 在聚合中去重
SELECT department,
COUNT(*) AS total_orders,
COUNT(DISTINCT customer_id) AS unique_customers
FROM orders
GROUP BY department;

-- SUM(DISTINCT ...) 对去重后的值求和(少见但有效)
SELECT SUM(DISTINCT salary) FROM employees;
-- 如果有两个人薪资相同,只计算一次

条件聚合(CASE WHEN + 聚合函数)

条件聚合是数据分析中最实用的技巧之一,可以在一条 SQL 中同时统计多个维度。

-- 一条 SQL 同时统计多个指标
SELECT
DATE_FORMAT(order_date, '%Y-%m') AS month,
COUNT(*) AS total_orders,
-- 各状态的订单数
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) AS completed_orders,
SUM(CASE WHEN status = 'cancelled' THEN 1 ELSE 0 END) AS cancelled_orders,
SUM(CASE WHEN status = 'refunded' THEN 1 ELSE 0 END) AS refunded_orders,
-- 完成率
ROUND(
SUM(CASE WHEN status = 'completed' THEN 1 ELSE 0 END) * 100.0 / COUNT(*), 2
) AS completion_rate,
-- 各渠道的金额
SUM(CASE WHEN channel = 'app' THEN amount ELSE 0 END) AS app_amount,
SUM(CASE WHEN channel = 'web' THEN amount ELSE 0 END) AS web_amount,
SUM(CASE WHEN channel = 'mini_program' THEN amount ELSE 0 END) AS mini_amount
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
ORDER BY month;
条件聚合 vs 多个子查询

上面这条 SQL 只需产表一次就能得到所有指标。如果分别写多条 SELECT + WHERE,数据库需要扫描多次表,效率低很多。

WITH ROLLUP 多级汇总

-- GROUP BY ... WITH ROLLUP 会自动加上小计和总计行
SELECT department, level, COUNT(*) AS cnt, SUM(salary) AS total_salary
FROM employees
GROUP BY department, level WITH ROLLUP;

结果:

departmentlevelcnttotal_salary
市场部P5112000
市场部P61NULL
市场部NULL212000
技术部P6115000
技术部P7238000
技术部NULL353000
NULLNULL565000

其中 NULL 表示汇总行。GROUPING() 函数可以区分是 NULL 数据还是汇总行。

更详细的多维汇总请参考 复杂聚合


数据分析实战场景

场景一:用户行为漏斗分析

-- 电商漏斗:浏览 → 加购 → 下单 → 支付
SELECT
COUNT(DISTINCT CASE WHEN event = 'view' THEN user_id END) AS view_users,
COUNT(DISTINCT CASE WHEN event = 'add_cart' THEN user_id END) AS cart_users,
COUNT(DISTINCT CASE WHEN event = 'order' THEN user_id END) AS order_users,
COUNT(DISTINCT CASE WHEN event = 'pay' THEN user_id END) AS pay_users,
-- 各步骤转化率
ROUND(
COUNT(DISTINCT CASE WHEN event = 'add_cart' THEN user_id END) * 100.0 /
NULLIF(COUNT(DISTINCT CASE WHEN event = 'view' THEN user_id END), 0), 2
) AS view_to_cart_rate,
ROUND(
COUNT(DISTINCT CASE WHEN event = 'pay' THEN user_id END) * 100.0 /
NULLIF(COUNT(DISTINCT CASE WHEN event = 'order' THEN user_id END), 0), 2
) AS order_to_pay_rate
FROM user_events
WHERE event_date BETWEEN '2024-01-01' AND '2024-01-31';
NULLIF 防除零

NULLIF(x, 0) 在 x = 0 时返回 NULL,避免除零报错。100 / NULL = NULL,比报错更友好。

场景二:按时间粒度聚合

-- 按天、周、月统计
SELECT
DATE(created_at) AS day,
COUNT(*) AS daily_orders,
SUM(amount) AS daily_revenue
FROM orders
GROUP BY DATE(created_at)
ORDER BY day;

-- 按周统计(YEARWEEK 返回年+周数)
SELECT
YEARWEEK(created_at, 1) AS year_week, -- 1 表示周一开始
MIN(DATE(created_at)) AS week_start,
COUNT(*) AS weekly_orders
FROM orders
GROUP BY YEARWEEK(created_at, 1)
ORDER BY year_week;

-- 按小时统计(适合分析流量高峰)
SELECT
HOUR(created_at) AS hour,
COUNT(*) AS hourly_orders
FROM orders
WHERE DATE(created_at) = '2024-01-15'
GROUP BY HOUR(created_at)
ORDER BY hour;

场景三:同比环比计算

-- 计算每月营收和环比增长
SELECT
curr.month,
curr.revenue,
prev.revenue AS prev_month_revenue,
ROUND((curr.revenue - COALESCE(prev.revenue, 0)) / NULLIF(prev.revenue, 0) * 100, 2) AS mom_growth_pct
FROM (
SELECT DATE_FORMAT(order_date, '%Y-%m') AS month, SUM(amount) AS revenue
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) curr
LEFT JOIN (
SELECT DATE_FORMAT(order_date, '%Y-%m') AS month, SUM(amount) AS revenue
FROM orders
GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) prev ON prev.month = DATE_FORMAT(
DATE_SUB(STR_TO_DATE(CONCAT(curr.month, '-01'), '%Y-%m-%d'), INTERVAL 1 MONTH), '%Y-%m'
)
ORDER BY curr.month;

-- 窗口函数版本更简洁(见 SQL 分析进阶)
SELECT month, revenue,
LAG(revenue) OVER (ORDER BY month) AS prev_revenue,
ROUND((revenue - LAG(revenue) OVER (ORDER BY month))
/ NULLIF(LAG(revenue) OVER (ORDER BY month), 0) * 100, 2) AS mom_growth_pct
FROM (
SELECT DATE_FORMAT(order_date, '%Y-%m') AS month, SUM(amount) AS revenue
FROM orders GROUP BY DATE_FORMAT(order_date, '%Y-%m')
) monthly;

场景四:分桶统计

-- 按订单金额分桶,统计各区间的订单数
SELECT
CASE
WHEN amount < 100 THEN '0-100'
WHEN amount < 500 THEN '100-500'
WHEN amount < 1000 THEN '500-1000'
WHEN amount < 5000 THEN '1000-5000'
ELSE '5000+'
END AS amount_range,
COUNT(*) AS order_count,
SUM(amount) AS total_amount,
ROUND(AVG(amount), 2) AS avg_amount
FROM orders
GROUP BY
CASE
WHEN amount < 100 THEN '0-100'
WHEN amount < 500 THEN '100-500'
WHEN amount < 1000 THEN '500-1000'
WHEN amount < 5000 THEN '1000-5000'
ELSE '5000+'
END
ORDER BY MIN(amount); -- 按实际金额排序,而非字母排序

场景五:用户留存率计算

-- 7 日留存率:某天注册的用户中,第 7 天还登录的占比
SELECT
DATE(u.register_date) AS cohort_date,
COUNT(DISTINCT u.id) AS new_users,
COUNT(DISTINCT CASE
WHEN DATEDIFF(l.login_date, u.register_date) = 1 THEN u.id
END) AS day1_retained,
COUNT(DISTINCT CASE
WHEN DATEDIFF(l.login_date, u.register_date) = 3 THEN u.id
END) AS day3_retained,
COUNT(DISTINCT CASE
WHEN DATEDIFF(l.login_date, u.register_date) = 7 THEN u.id
END) AS day7_retained,
ROUND(
COUNT(DISTINCT CASE WHEN DATEDIFF(l.login_date, u.register_date) = 7 THEN u.id END) * 100.0 /
COUNT(DISTINCT u.id), 2
) AS day7_retention_rate
FROM users u
LEFT JOIN login_log l ON u.id = l.user_id
WHERE u.register_date BETWEEN '2024-01-01' AND '2024-01-31'
GROUP BY DATE(u.register_date)
ORDER BY cohort_date;

常见面试问题

Q1: 写一个 SQL:查出每个部门薪资第 2 高的员工

答案

-- 使用窗口函数(推荐,见 SQL 分析进阶)
SELECT * FROM (
SELECT *,
DENSE_RANK() OVER (PARTITION BY department ORDER BY salary DESC) AS rk
FROM employees
) t
WHERE rk = 2;

如果不能用窗口函数,可以用子查询:

-- 子查询解法
SELECT * FROM employees e1
WHERE 1 = (
SELECT COUNT(DISTINCT e2.salary)
FROM employees e2
WHERE e2.department = e1.department AND e2.salary > e1.salary
);

Q2: COUNT(*) vs COUNT(1) vs COUNT(column) 有什么区别?

答案

写法计数规则NULL 处理
COUNT(*)所有行包含 NULL 行
COUNT(1)所有行(等价于 COUNT(*)包含 NULL 行
COUNT(column)该列非 NULL 的行排除 NULL
COUNT(DISTINCT column)该列不重复的非 NULL 值排除 NULL

COUNT(*)COUNT(1) 在 MySQL 中性能完全一样,优化器会统一处理。

Q3: GROUP BY 后能否使用 SELECT 中的别名?

答案

这取决于数据库:

  • MySQL:✅ 可以(MySQL 扩展了标准 SQL)
  • PostgreSQL:❌ 不可以
  • 标准 SQL:❌ 不可以(因为 GROUP BY 在 SELECT 之前执行)
-- MySQL 允许
SELECT YEAR(created_at) AS yr, COUNT(*) FROM orders GROUP BY yr;

-- PostgreSQL 需要重复表达式
SELECT YEAR(created_at), COUNT(*) FROM orders GROUP BY YEAR(created_at);
-- 或者用位置编号
SELECT YEAR(created_at), COUNT(*) FROM orders GROUP BY 1;

Q4: 面试题——查出重复的邮箱

答案

-- 使用 GROUP BY + HAVING
SELECT email, COUNT(*) AS cnt
FROM users
GROUP BY email
HAVING cnt > 1;

-- 只要邮箱列表
SELECT email FROM users GROUP BY email HAVING COUNT(*) > 1;

Q5: 面试题——统计每月的注册人数

答案

-- MySQL
SELECT
DATE_FORMAT(created_at, '%Y-%m') AS month,
COUNT(*) AS registrations
FROM users
GROUP BY month
ORDER BY month;

-- PostgreSQL
SELECT
TO_CHAR(created_at, 'YYYY-MM') AS month,
COUNT(*) AS registrations
FROM users
GROUP BY month
ORDER BY month;

Q6: WHERE 和 HAVING 的区别?

答案

对比WHEREHAVING
执行时机GROUP BY 之前GROUP BY 之后
作用对象原始行分组后的组
是否能用聚合函数❌ 不能✅ 可以
性能先过滤,减少计算量后过滤,已经完成聚合
-- ❌ 错误:WHERE 中不能用聚合函数
SELECT department, COUNT(*) FROM employees WHERE COUNT(*) > 2 GROUP BY department;

-- ✅ COUNT 条件放到 HAVING
SELECT department, COUNT(*) AS cnt FROM employees GROUP BY department HAVING cnt > 2;

-- ✅ 非聚合条件尽量放 WHERE(性能更好)
SELECT department, COUNT(*) FROM employees
WHERE salary > 10000 -- 先过滤低薪员工
GROUP BY department
HAVING COUNT(*) > 2; -- 再过滤小部门
性能建议

能用 WHERE 的条件就不要放到 HAVING。WHERE 先执行,减少了 GROUP BY 需要处理的数据量。

Q7: 面试题——查出每个城市不同品类的销售额

答案

-- 使用条件聚合实现交叉表
SELECT
city,
SUM(CASE WHEN category = '电子' THEN amount ELSE 0 END) AS electronics,
SUM(CASE WHEN category = '服装' THEN amount ELSE 0 END) AS clothing,
SUM(CASE WHEN category = '食品' THEN amount ELSE 0 END) AS food,
SUM(amount) AS total
FROM orders
GROUP BY city
ORDER BY total DESC;

Q8: GROUP BY 的执行原理?

答案

MySQL 中 GROUP BY 主要有两种执行方式:

  1. 利用索引:如果 GROUP BY 的列上有索引,可以直接扫描索引完成分组(EXPLAIN 中显示 Using index),效率最高
  2. 临时表:没有合适索引时,MySQL 创建临时表,扫描数据后写入临时表分组(EXPLAIN 显示 Using temporary
  3. 文件排序:如果还需要排序,会额外进行文件排序(EXPLAIN 显示 Using filesort
-- 优化技巧:给 GROUP BY 列建索引
ALTER TABLE orders ADD INDEX idx_date (order_date);

-- 查看执行计划
EXPLAIN SELECT DATE(order_date), COUNT(*) FROM orders GROUP BY DATE(order_date);

Q9: 面试题——在一条 SQL 中同时求平均值和中位数

答案

-- 平均值简单,中位数需要技巧

-- MySQL 8.0+ 用窗口函数求中位数
SELECT
AVG(salary) AS avg_salary,
-- 中位数
AVG(mid_salary) AS median_salary
FROM (
SELECT salary,
salary AS mid_salary,
ROW_NUMBER() OVER (ORDER BY salary) AS rn,
COUNT(*) OVER () AS total
FROM employees
) t
WHERE rn IN (FLOOR((total + 1) / 2), CEIL((total + 1) / 2));

-- MySQL 5.7 中位数写法(更复杂)
SELECT AVG(salary) AS median_salary FROM (
SELECT e1.salary FROM employees e1
JOIN employees e2 ON 1 = 1
GROUP BY e1.salary
HAVING SUM(CASE WHEN e2.salary <= e1.salary THEN 1 ELSE 0 END) >= COUNT(*) / 2
AND SUM(CASE WHEN e2.salary >= e1.salary THEN 1 ELSE 0 END) >= COUNT(*) / 2
) t;

-- PostgreSQL 有内置的百分位函数
SELECT
AVG(salary) AS avg_salary,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary) AS median_salary
FROM employees;

Q10: 如何统计"连续"事件?

答案

-- 统计每个用户的最长连续登录天数
-- 核心思路:login_date - ROW_NUMBER = 分组标记(连续日期减去递增编号,差值相同)
SELECT user_id, MAX(streak) AS max_streak FROM (
SELECT user_id, grp, COUNT(*) AS streak
FROM (
SELECT user_id, login_date,
DATE_SUB(login_date, INTERVAL ROW_NUMBER() OVER (
PARTITION BY user_id ORDER BY login_date
) DAY) AS grp
FROM (SELECT DISTINCT user_id, DATE(login_date) AS login_date FROM login_log) t
) t2
GROUP BY user_id, grp
) t3
GROUP BY user_id
ORDER BY max_streak DESC;

更多窗口函数技巧请参考 窗口函数


相关链接