4.文本分析组件

予早 2025-10-07 00:27:49

Categories： Tags：

文本分析组件

文本分析组件包含分析器（Analyzer）和分词器（Tokenizer）。

分词器用于将原始文本转换为标准化词条（tokens），用于索引或搜索。其包含三部分：

字符过滤器（Character Filters）：预处理文本（如去除HTML标签、替换特殊字符）。
分词器（Tokenizer）：将文本切分为单个词条（如按空格、标点分割）。
词元过滤器（Token Filters）：对词条进行后处理（如转小写、删除停用词、同义词扩展）。

分析器

内置分析器

https://www.elastic.co/docs/reference/text-analysis/analyzer-reference

名称	作用
standard	默认分词器，standard 分析器根据 Unicode 文本分割算法定义的单词边界将文本拆分为词条。其会移除大多数标点符号，将词条转为小写，并支持移除停用词。
simple	simple 分析器会在遇到非字母字符时将文本拆分为词条。它会将所有词条转换为小写。
stop	stop 分析器类似于 simple 分析器，但还支持移除停用词。
whitespace	whitespace 分析器会在遇到任何空白字符时将文本拆分为词条。其不会将词条转换为小写。
language	Elasticsearch 提供了许多特定于语言的分析器，例如 english 或 french 分析器。
pattern	pattern 分析器使用正则表达式将文本拆分为词条。支持小写转换和停用词处理。
keyword	keyword 分析器是一种无操作分析器，接收输入的任何文本，并将完全相同的文本作为单个词条输出。
fingerprint	fingerprint 分析器是一种专用分析器，其可以生成可用于重复检测的指纹。

三方分析器

IK 分析器

https://github.com/infinilabs/analysis-ik

IK 分析器提供两种方式：

ik_max_word：将需要分词的文本做最小粒度的拆分，尽量分更多的词
ik_smart：将需要分词的文本做最大粒度的拆分

# 1. 安装IK 分词器，注意版本需要和 ES 保持一致
./elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-ik/8.19.3
# 2. 查看插件列表
./elasticsearch-plugin list
# 3. 重启ES

GET _analyze
{
  "analyzer": "ik_smart",
  "text": ["大王叫我来巡山"]
}

IK 分析器扩展配置

可以配置扩展字典和扩展停止词字典，包括本地字典和远程字典。

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict"></entry>
    <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <!-- <entry key="remote_ext_dict">words_location</entry> -->
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

配置远程扩展字典解决特定场景搜索

基金产品名称通常是非常规词汇，假设有基金产品赢图得升666号C类份额，优化该名称搜索。

POST /_analyze
{
  "analyzer": "ik_max_word",
  "text": "赢图得升666号C类份额"
}

{
  "tokens": [
    {
      "token": "赢",
      "start_offset": 0,
      "end_offset": 1,
      "type": "CN_CHAR",
      "position": 0
    },
    {
      "token": "图",
      "start_offset": 1,
      "end_offset": 2,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "得",
      "start_offset": 2,
      "end_offset": 3,
      "type": "CN_CHAR",
      "position": 2
    },
    {
      "token": "升",
      "start_offset": 3,
      "end_offset": 4,
      "type": "CN_CHAR",
      "position": 3
    },
    {
      "token": "666",
      "start_offset": 4,
      "end_offset": 7,
      "type": "ARABIC",
      "position": 4
    },
    {
      "token": "号",
      "start_offset": 7,
      "end_offset": 8,
      "type": "COUNT",
      "position": 5
    },
    {
      "token": "c",
      "start_offset": 8,
      "end_offset": 9,
      "type": "ENGLISH",
      "position": 6
    },
    {
      "token": "类",
      "start_offset": 9,
      "end_offset": 10,
      "type": "CN_CHAR",
      "position": 7
    },
    {
      "token": "份额",
      "start_offset": 10,
      "end_offset": 12,
      "type": "CN_WORD",
      "position": 8
    }
  ]
}

ik_dict/ik_remote_dict.txt

赢图得升
赢图
得升

在 ik_dict 目录执行

python -m http.server 8000 > http.log 2>&1

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 扩展配置</comment>
    <!--用户可以在这里配置自己的扩展字典 -->
    <entry key="ext_dict"></entry>
    <!--用户可以在这里配置自己的扩展停止词字典-->
    <entry key="ext_stopwords"></entry>
    <!--用户可以在这里配置远程扩展字典 -->
    <entry key="remote_ext_dict">http://localhost:8000/ik_remote_dict.txt</entry>
    <!--用户可以在这里配置远程扩展停止词字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
</properties>

{
  "tokens": [
    {
      "token": "赢图得升",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "赢图",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "得升",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "666",
      "start_offset": 4,
      "end_offset": 7,
      "type": "ARABIC",
      "position": 3
    },
    {
      "token": "号",
      "start_offset": 7,
      "end_offset": 8,
      "type": "COUNT",
      "position": 4
    },
    {
      "token": "c",
      "start_offset": 8,
      "end_offset": 9,
      "type": "ENGLISH",
      "position": 5
    },
    {
      "token": "类",
      "start_offset": 9,
      "end_offset": 10,
      "type": "CN_CHAR",
      "position": 6
    },
    {
      "token": "份额",
      "start_offset": 10,
      "end_offset": 12,
      "type": "CN_WORD",
      "position": 7
    }
  ]
}

拼音分析器

https://github.com/infinilabs/analysis-pinyin

# 1. 安装拼音分词器，注意版本需要和 ES 保持一致
./elasticsearch-plugin install https://get.infini.cloud/elasticsearch/analysis-pinyin/8.19.3
# 2. 查看插件列表
./elasticsearch-plugin list
# 3. 重启ES

分词器

https://www.elastic.co/docs/reference/text-analysis/tokenizer-reference

5.集成

3.查询语言