聚合分析

问题

Elasticsearch 的聚合框架有哪些类型？如何用聚合实现实时数据分析？

答案

聚合框架概述

ES 的聚合（Aggregation）框架提供强大的实时数据分析能力，可以在搜索结果上做统计。聚合分为三大类：

类型	说明	类似 SQL
桶聚合（Bucket）	按条件分组	`GROUP BY`
度量聚合（Metric）	对每组计算统计值	`SUM`、`AVG`、`COUNT`
管道聚合（Pipeline）	对聚合结果再聚合	子查询 / 窗口函数

// 基本结构
POST /orders/_search
{
  "size": 0,  // 不返回文档，只返回聚合结果
  "aggs": {
    "聚合名称": {
      "聚合类型": { ... },
      "aggs": {
        "子聚合名称": { ... }  // 聚合可以嵌套
      }
    }
  }
}

桶聚合（Bucket Aggregations）

桶聚合将文档分组到不同的「桶」中，类似 SQL 的 GROUP BY。

terms 聚合

按字段值分组（只能用于 keyword 字段）：

{
  "size": 0,
  "aggs": {
    "brand_distribution": {
      "terms": {
        "field": "brand",
        "size": 10,               // 返回 Top 10 品牌
        "order": { "_count": "desc" },
        "min_doc_count": 5        // 至少 5 个文档才返回
      }
    }
  }
}

terms 聚合的精确度问题

在分布式环境中，每个分片只返回 Top N，协调节点合并后可能不够精确。可以增大 shard_size（默认 size * 1.5 + 10）来提高精确度，但会增加内存和网络开销。

date_histogram 聚合

按时间间隔分桶：

{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "createdAt",
        "calendar_interval": "month",   // 按月分桶
        "format": "yyyy-MM",
        "min_doc_count": 0,             // 包含空桶
        "extended_bounds": {            // 强制包含的时间范围
          "min": "2024-01-01",
          "max": "2024-12-31"
        }
      },
      "aggs": {                         // 每个桶内再统计
        "total_revenue": { "sum": { "field": "amount" } }
      }
    }
  }
}

间隔参数	说明
`calendar_interval`	日历间隔（day, week, month, quarter, year）
`fixed_interval`	固定间隔（1h, 30m, 7d）

range 聚合

自定义范围分桶：

{
  "aggs": {
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "key": "便宜", "to": 100 },
          { "key": "中等", "from": 100, "to": 500 },
          { "key": "贵",   "from": 500 }
        ]
      }
    }
  }
}

histogram 聚合

固定间隔分桶（数值版本的 date_histogram）：

{
  "aggs": {
    "price_histogram": {
      "histogram": {
        "field": "price",
        "interval": 100,         // 每 100 元一个桶
        "min_doc_count": 1
      }
    }
  }
}

filter / filters 聚合

按自定义过滤条件分桶：

{
  "aggs": {
    "status_breakdown": {
      "filters": {
        "filters": {
          "active":   { "term": { "status": "active" } },
          "inactive": { "term": { "status": "inactive" } },
          "pending":  { "term": { "status": "pending" } }
        }
      }
    }
  }
}

度量聚合（Metric Aggregations）

度量聚合对一组文档计算统计值。

常用度量聚合

{
  "size": 0,
  "aggs": {
    "avg_price":  { "avg":  { "field": "price" } },
    "max_price":  { "max":  { "field": "price" } },
    "min_price":  { "min":  { "field": "price" } },
    "total_sales": { "sum": { "field": "amount" } },
    "order_count": { "value_count": { "field": "orderId" } },
    "unique_users": { "cardinality": { "field": "userId" } }  // 去重计数
  }
}

stats 聚合

一次性返回 count、min、max、avg、sum：

{
  "aggs": {
    "price_stats": {
      "stats": { "field": "price" }
    }
    // 返回：{ count: 1000, min: 10, max: 9999, avg: 299.5, sum: 299500 }
  }
}

percentiles 聚合

计算百分位数（如 P50、P95、P99）：

{
  "aggs": {
    "response_time_percentiles": {
      "percentiles": {
        "field": "responseTime",
        "percents": [50, 90, 95, 99]
      }
    }
  }
}
// 返回：{ 50: 120, 90: 350, 95: 500, 99: 1200 }

top_hits 聚合

每个桶内取 Top N 文档：

{
  "aggs": {
    "by_brand": {
      "terms": { "field": "brand", "size": 5 },
      "aggs": {
        "top_products": {
          "top_hits": {
            "size": 3,
            "sort": [{ "sales": { "order": "desc" } }],
            "_source": ["title", "price", "sales"]
          }
        }
      }
    }
  }
}

管道聚合（Pipeline Aggregations）

管道聚合对 其他聚合的结果 进行二次计算。

常用管道聚合

{
  "size": 0,
  "aggs": {
    "monthly_sales": {
      "date_histogram": {
        "field": "createdAt",
        "calendar_interval": "month"
      },
      "aggs": {
        "revenue": { "sum": { "field": "amount" } }
      }
    },
    // 管道聚合：累计总和
    "cumulative_revenue": {
      "cumulative_sum": {
        "buckets_path": "monthly_sales>revenue"
      }
    },
    // 管道聚合：月度环比变化
    "monthly_change": {
      "derivative": {
        "buckets_path": "monthly_sales>revenue"
      }
    },
    // 管道聚合：移动平均
    "moving_avg_revenue": {
      "moving_avg": {
        "buckets_path": "monthly_sales>revenue",
        "window": 3
      }
    }
  }
}

管道聚合	功能
`derivative`	求导数（环比变化）
`cumulative_sum`	累计总和
`moving_avg`	移动平均
`bucket_sort`	对桶排序
`bucket_selector`	过滤桶
`avg_bucket`	所有桶的平均值

聚合嵌套实战

实战：电商多维度分析

POST /orders/_search
{
  "size": 0,
  "query": {
    "range": { "createdAt": { "gte": "2024-01-01", "lt": "2025-01-01" } }
  },
  "aggs": {
    // 第一层：按品类分桶
    "by_category": {
      "terms": { "field": "category", "size": 20 },
      "aggs": {
        // 第二层：每个品类内按月统计
        "monthly": {
          "date_histogram": {
            "field": "createdAt",
            "calendar_interval": "month"
          },
          "aggs": {
            // 第三层度量：月度统计
            "revenue": { "sum": { "field": "amount" } },
            "order_count": { "value_count": { "field": "_id" } },
            "avg_order_value": { "avg": { "field": "amount" } }
          }
        },
        // 品类级别统计
        "total_revenue": { "sum": { "field": "amount" } },
        "top_products": {
          "top_hits": {
            "size": 3,
            "sort": [{ "amount": "desc" }],
            "_source": ["title", "amount"]
          }
        }
      }
    }
  }
}

聚合性能优化

优化策略	说明
`size: 0`	不返回文档，只返回聚合结果
使用 `filter`	先缩小数据范围再聚合
`keyword` 类型	聚合和排序只能用 keyword，不能用 text
`execution_hint: "map"`	高基数 terms 聚合时切换执行策略
`shard_size`	适当增大 shard_size 提升精度
`collect_mode: "breadth_first"`	多层嵌套聚合时优化内存

常见面试问题

Q1: ES 的聚合和 SQL GROUP BY 有什么区别？

答案：

维度	ES 聚合	SQL GROUP BY
嵌套	支持多层嵌套聚合	需要子查询或 CTE
与搜索结合	聚合基于搜索结果	GROUP BY 是独立子句
实时性	近实时	取决于数据库
分布式	原生分布式聚合	通常单机
近似算法	cardinality 用 HLL 近似	COUNT DISTINCT 精确

Q2: cardinality（去重计数）是精确的吗？

答案：

不是精确的。cardinality 聚合使用 HyperLogLog++（HLL） 算法，是一种近似算法，误差率约 0.01-5%。

优势是内存固定（与数据量无关），适合海量数据的去重计数（如 UV 统计）。可以通过 precision_threshold 参数调整精度（默认 3000，越大越精确但越耗内存）。

如果需要精确去重，可以用 composite 聚合 + 翻页遍历所有桶来实现。

Q3: 聚合时为什么不能用 text 字段？

答案：

text 字段存储的是分词后的词项，而不是原始值。如果对 text 字段做 terms 聚合，会按词项（而非原始文本）分桶，结果毫无意义。

解决方案：使用 multi-fields，同时定义 text 和 keyword 子字段：

{
  "title": {
    "type": "text",
    "fields": {
      "keyword": { "type": "keyword" }
    }
  }
}
// 搜索用 title，聚合用 title.keyword

Q4: 如何理解 `buckets_path` 语法？

答案：

buckets_path 用于管道聚合引用其他聚合的结果：

agg_name：引用同级聚合
parent_agg>child_agg：引用嵌套聚合（> 分隔层级）
agg_name[key]：引用多值度量中的某个值

问题​

答案​

聚合框架概述​

桶聚合（Bucket Aggregations）​

terms 聚合​

date_histogram 聚合​

range 聚合​

histogram 聚合​

filter / filters 聚合​

度量聚合（Metric Aggregations）​

常用度量聚合​

stats 聚合​

percentiles 聚合​

top_hits 聚合​

管道聚合（Pipeline Aggregations）​

常用管道聚合​

聚合嵌套实战​

实战：电商多维度分析​

聚合性能优化​

常见面试问题​

Q1: ES 的聚合和 SQL GROUP BY 有什么区别？​

Q2: cardinality（去重计数）是精确的吗？​

Q3: 聚合时为什么不能用 text 字段？​

Q4: 如何理解 buckets_path 语法？​

相关链接​

问题

答案