Mapping 与分词器

问题

Elasticsearch 的 Mapping 是什么？分词器（Analyzer）如何工作？如何处理中文分词？

答案

Mapping 概念

Mapping 是 ES 的 「表结构定义」，规定了一个 Index 中每个字段的名称、数据类型和索引方式。Mapping 决定了如何存储和搜索数据。

// 定义一个商品 Index 的 Mapping
PUT /products
{
  "mappings": {
    "properties": {
      "title":       { "type": "text", "analyzer": "ik_max_word" },
      "description": { "type": "text", "analyzer": "ik_smart" },
      "price":       { "type": "float" },
      "brand":       { "type": "keyword" },
      "tags":        { "type": "keyword" },
      "createdAt":   { "type": "date", "format": "yyyy-MM-dd HH:mm:ss||epoch_millis" },
      "location":    { "type": "geo_point" },
      "specs":       { "type": "object" },
      "isOnSale":    { "type": "boolean" }
    }
  }
}

核心字段类型

文本类型

类型	是否分词	是否支持全文搜索	是否支持精确匹配	典型场景
`text`	✅	✅（match）	❌	文章标题、商品描述
`keyword`	❌	❌	✅（term）	品牌、标签、状态

text vs keyword 是最核心的区别

text：会经过分词器拆分为词项，建倒排索引。用于全文搜索（match 查询）
keyword：不分词，整个值作为一个词项。用于精确匹配、排序、聚合（term 查询）

面试中经常被问到！

数值类型

类型	范围	适用场景
`byte`	-128 ~ 127	极小数值
`short`	-32768 ~ 32767	小数值
`integer`	-2^31 ~ 2^31-1	常规整数
`long`	-2^63 ~ 2^63-1	大整数
`float`	32 位浮点	一般小数
`double`	64 位浮点	高精度小数
`scaled_float`	缩放浮点	价格（如乘 100 存为整数）

其他类型

类型	说明
`date`	日期，支持多种格式
`boolean`	布尔值
`object`	JSON 对象（扁平化存储）
`nested`	嵌套对象（独立文档，保持内部关联）
`geo_point`	经纬度坐标
`ip`	IPv4/IPv6 地址
`completion`	自动补全类型

object vs nested

这是一个常见面试考点：

// object 类型 — 扁平化存储，丢失内部对象的关联
{
  "comments": [
    { "user": "Alice", "text": "很好" },
    { "user": "Bob", "text": "不好" }
  ]
}
// ES 内部存储为：
// comments.user: ["Alice", "Bob"]
// comments.text: ["很好", "不好"]
// → 搜索 user=Alice AND text=不好 会错误匹配！

// nested 类型 — 每个对象作为独立的隐藏文档
{
  "mappings": {
    "properties": {
      "comments": {
        "type": "nested",
        "properties": {
          "user": { "type": "keyword" },
          "text": { "type": "text" }
        }
      }
    }
  }
}
// 用 nested 查询，user=Alice AND text=不好 正确不匹配

动态 Mapping vs 显式 Mapping

方式	说明	适用场景
动态 Mapping	ES 自动推断字段类型	快速原型开发
显式 Mapping	手动定义每个字段	生产环境

// 控制动态 Mapping 行为
PUT /my_index
{
  "mappings": {
    "dynamic": "strict",  // strict: 拒绝未定义字段；true: 自动添加；false: 忽略
    "properties": {
      "title": { "type": "text" }
    }
  }
}

生产环境必须用显式 Mapping

动态 Mapping 会把字符串自动映射为 text + keyword 双类型，数字推断为 long/float，可能不符合预期。一旦字段类型确定，无法直接修改（需要重建索引）。

常用 Mapping 参数

参数	说明	示例
`index`	是否建倒排索引	`"index": false`（不可搜索）
`analyzer`	索引时使用的分词器	`"analyzer": "ik_max_word"`
`search_analyzer`	搜索时使用的分词器	`"search_analyzer": "ik_smart"`
`doc_values`	是否支持排序/聚合	`"doc_values": false`（节省磁盘）
`store`	是否独立存储原始值	`"store": true`
`copy_to`	复制字段值到目标字段	`"copy_to": "full_text"`
`fields`	多字段映射（一个字段多种类型）	同时支持全文搜索和精确匹配

多字段（Multi-fields）

一个字段可以同时以多种方式索引：

{
  "title": {
    "type": "text",
    "analyzer": "ik_max_word",
    "fields": {
      "keyword": {          // title.keyword 精确匹配
        "type": "keyword",
        "ignore_above": 256
      },
      "pinyin": {           // title.pinyin 拼音搜索
        "type": "text",
        "analyzer": "pinyin"
      }
    }
  }
}

搜索时可以分别使用：title（全文搜索）、title.keyword（精确匹配/聚合）、title.pinyin（拼音搜索）。

分词器（Analyzer）

分词器负责将文本拆分为词项（Term），是全文搜索的核心组件。

分词器的三个组成部分

组件	作用	示例
Character Filter	预处理原始文本	去除 HTML 标签、字符替换
Tokenizer	将文本拆分为词项	按空格拆分、按词边界拆分
Token Filter	对词项做变换	转小写、去停用词、同义词

内置分词器

分词器	分词方式	示例输入 → 输出
standard	按词边界分（默认）	"Hello World" → `[hello, world]`
simple	按非字母字符分	"Hello-World 123" → `[hello, world]`
whitespace	按空格分	"Hello World" → `[Hello, World]`
keyword	不分词（整个值）	"Hello World" → `[Hello World]`
language	语言特定	english、chinese 等

测试分词效果

// 使用 _analyze API 测试分词
POST /_analyze
{
  "analyzer": "standard",
  "text": "Elasticsearch is a search engine"
}
// 结果：[elasticsearch, is, a, search, engine]

// 测试中文（standard 按单字切分）
POST /_analyze
{
  "analyzer": "standard",
  "text": "我爱北京天安门"
}
// 结果：[我, 爱, 北, 京, 天, 安, 门]  ← 单字切分，无意义！

中文分词

标准分词器对中文按单字切分，效果很差。需要使用专门的中文分词器。

IK 分词器

IK Analysis 是最流行的 ES 中文分词插件，提供两种分词模式：

模式	说明	示例
`ik_max_word`	最细粒度分词（组合更多词）	"中华人民共和国" → `[中华人民共和国, 中华人民, 中华, 华人, 人民共和国, 人民, 共和国, 共和, 国]`
`ik_smart`	最粗粒度分词（智能切分）	"中华人民共和国" → `[中华人民共和国]`

最佳实践：索引用 ik_max_word，搜索用 ik_smart

索引时用 ik_max_word：尽可能多地切分词项，建立更完整的倒排索引
搜索时用 ik_smart：按用户输入的合理粒度搜索，避免过多噪声

{
  "title": {
    "type": "text",
    "analyzer": "ik_max_word",           // 索引分词
    "search_analyzer": "ik_smart"        // 搜索分词
  }
}

IK 自定义词典

可以添加自定义词典，识别领域专有名词：

<!-- IKAnalyzer.cfg.xml -->
<properties>
  <entry key="ext_dict">custom/mydict.dic</entry>
  <entry key="ext_stopwords">custom/stopword.dic</entry>
  <entry key="remote_ext_dict">http://example.com/dict.txt</entry>
</properties>

自定义分词器

组合 Character Filter + Tokenizer + Token Filter 创建自定义分词器：

PUT /my_index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "html_strip": {
          "type": "html_strip"  // 去除 HTML 标签
        }
      },
      "tokenizer": {
        "ik_tokenizer": {
          "type": "ik_max_word"
        }
      },
      "filter": {
        "my_synonym": {
          "type": "synonym",
          "synonyms": ["ES,Elasticsearch", "手机,手机设备"]
        },
        "my_stopwords": {
          "type": "stop",
          "stopwords": ["的", "了", "和", "是"]
        }
      },
      "analyzer": {
        "my_analyzer": {
          "type": "custom",
          "char_filter": ["html_strip"],
          "tokenizer": "ik_tokenizer",
          "filter": ["lowercase", "my_synonym", "my_stopwords"]
        }
      }
    }
  }
}

Mapping 变更策略

Mapping 中字段类型一旦创建 不能直接修改。如需变更：

// 1. 创建新 Index
PUT /products_v2 { "mappings": { ... } }

// 2. 迁移数据
POST /_reindex
{
  "source": { "index": "products_v1" },
  "dest": { "index": "products_v2" }
}

// 3. 切换别名（零停机）
POST /_aliases
{
  "actions": [
    { "remove": { "index": "products_v1", "alias": "products" } },
    { "add": { "index": "products_v2", "alias": "products" } }
  ]
}

常见面试问题

Q1: text 和 keyword 有什么区别？

答案：

维度	`text`	`keyword`
分词	✅ 经过分词器拆分	❌ 不分词，整体存储
全文搜索	✅ `match` 查询	❌
精确匹配	❌	✅ `term` 查询
排序	❌	✅
聚合	❌	✅
适用字段	文章内容、商品描述	品牌、标签、状态、邮箱

Q2: IK 的 ik_max_word 和 ik_smart 有什么区别？

答案：

ik_max_word：尽可能多的切分，把所有可能的组合都提取出来。如「中华人民共和国」切分为 9 个词项
ik_smart：智能最粗粒度切分，尽量不重叠。如「中华人民共和国」只切分为 1 个完整词
最佳实践：索引时用 ik_max_word（提高召回率），搜索时用 ik_smart（提高精确度）

Q3: 如何实现搜索建议（自动补全）？

答案：

使用 completion 类型 + Suggest API：

// Mapping
{ "suggest": { "type": "completion", "analyzer": "ik_max_word" } }

// 索引
PUT /products/_doc/1
{ "suggest": { "input": ["iPhone 15", "苹果手机", "Apple iPhone"] } }

// 搜索建议查询
POST /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "iPh",
      "completion": { "field": "suggest", "size": 5 }
    }
  }
}

Q4: Mapping 创建后能修改字段类型吗？

答案：

不能。Mapping 中已有字段的类型不可更改（因为已有数据按旧类型建了倒排索引）。只能：

新增字段：可以随时添加新字段
修改类型：需要创建新 Index + _reindex 迁移数据
使用别名：通过别名实现零停机切换

Q5: object 和 nested 有什么区别？

答案：

object：将嵌套对象的字段扁平化存储。数组中多个对象的同名字段值会被合并，丢失了对象间的关联关系。查询可能产生误匹配
nested：每个嵌套对象存储为独立的隐藏文档。保持了对象间的关联关系。需要使用专门的 nested 查询

选择原则：如果数组中每个对象的字段需要作为一个整体来查询，用 nested。否则用 object 即可（性能更好）。

问题​

答案​

Mapping 概念​

核心字段类型​

文本类型​

数值类型​

其他类型​

object vs nested​

动态 Mapping vs 显式 Mapping​

常用 Mapping 参数​

多字段（Multi-fields）​

分词器（Analyzer）​

分词器的三个组成部分​

内置分词器​

测试分词效果​

中文分词​

IK 分词器​

IK 自定义词典​

自定义分词器​

Mapping 变更策略​

常见面试问题​

Q1: text 和 keyword 有什么区别？​

Q2: IK 的 ik_max_word 和 ik_smart 有什么区别？​

Q3: 如何实现搜索建议（自动补全）？​

Q4: Mapping 创建后能修改字段类型吗？​

Q5: object 和 nested 有什么区别？​

相关链接​

问题

答案