Elasticsearch(5)之集成IK分词器-张种恩的技术小栈

查询存在问题分析

之前有提到过，因目前使用的是 ElasticSearch 默认提供的标准分析器，对于中文的分析是直接将每个汉字拆分为一个关键字，这显然不是我们想要的，可以先看一下标准分析器的分词效果。
以 post 方式发送请求到 127.0.0.1:9200/_analyze ,请求体如下：

{
	"analyzer":"standard",
	"text":"This library is used to not only read Maven project object model files"
}

响应：

{
    "tokens": [
        {
            "token": "this",
            "start_offset": 0,
            "end_offset": 4,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "library",
            "start_offset": 5,
            "end_offset": 12,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "is",
            "start_offset": 13,
            "end_offset": 15,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "used",
            "start_offset": 16,
            "end_offset": 20,
            "type": "<ALPHANUM>",
            "position": 3
        },
        {
            "token": "to",
            "start_offset": 21,
            "end_offset": 23,
            "type": "<ALPHANUM>",
            "position": 4
        },
        {
            "token": "not",
            "start_offset": 24,
            "end_offset": 27,
            "type": "<ALPHANUM>",
            "position": 5
        },
        {
            "token": "only",
            "start_offset": 28,
            "end_offset": 32,
            "type": "<ALPHANUM>",
            "position": 6
        },
        {
            "token": "read",
            "start_offset": 33,
            "end_offset": 37,
            "type": "<ALPHANUM>",
            "position": 7
        },
        {
            "token": "maven",
            "start_offset": 38,
            "end_offset": 43,
            "type": "<ALPHANUM>",
            "position": 8
        },
        {
            "token": "project",
            "start_offset": 44,
            "end_offset": 51,
            "type": "<ALPHANUM>",
            "position": 9
        },
        {
            "token": "object",
            "start_offset": 52,
            "end_offset": 58,
            "type": "<ALPHANUM>",
            "position": 10
        },
        {
            "token": "model",
            "start_offset": 59,
            "end_offset": 64,
            "type": "<ALPHANUM>",
            "position": 11
        },
        {
            "token": "files",
            "start_offset": 65,
            "end_offset": 70,
            "type": "<ALPHANUM>",
            "position": 12
        }
    ]
}

可以看到，标准分析器对于英文是可以正常的分词。
再来测试一下中文，修改请求体为如下：

{
	"analyzer":"standard",
	"text":"我是程序员"
}

响应：

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "<IDEOGRAPHIC>",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "<IDEOGRAPHIC>",
            "position": 1
        },
        {
            "token": "程",
            "start_offset": 2,
            "end_offset": 3,
            "type": "<IDEOGRAPHIC>",
            "position": 2
        },
        {
            "token": "序",
            "start_offset": 3,
            "end_offset": 4,
            "type": "<IDEOGRAPHIC>",
            "position": 3
        },
        {
            "token": "员",
            "start_offset": 4,
            "end_offset": 5,
            "type": "<IDEOGRAPHIC>",
            "position": 4
        }
    ]
}

的确如我们之前所说，它将中文字符串直接分割成了一个个的汉字，而对于中文，"程序"、"程序员" 显然应该被分析为一个词。
如果要达到我们需要期望的效果，分析器就需要对中文有良好的支持。支持中文分词的分析器有很多，如 word分词器、庖丁解牛、盘古分词、Ansj 分词等，但我们常用的就是下面要介绍的 IK 分词器。

IK分词器简介

IKAnalyzer 是一个开源的，基于 Java 语言开发的轻量级的中文分词工具包，从 2006 年 12 月推出 1.0 版开始，IKAnalyzer 已经推出了 3 个大版本。最初，它是以开源项目 Lucene 为应用主体的，结合词典分词和文法分析算法的中文分词组件。新版本的 IKAnalyzer3.0 则发展为面向 Java 的公用分词组件，独立与 Lucene 项目，同时提供了对 Lucene 的默认优化实现。
IK 分词器 3.0 的特性如下：

采用了特有的”正向迭代最细粒度切分算法“，具有 60 万字/秒的高速处理能力。
采用了多子处理器分析模式，支持：英文字母（IP 地址、Email、URL）、数字（日期、常用中文数量词、罗马数字、科学计数法）、中文词汇（姓名、地名）等分词处理。
对中英联合支持不是很好，在这方面的处理比较麻烦，需再做一次查询，同时是支持个人词条的优化的词典存储，更小的内存占用。
支持用户词典扩展定义。
针对 Lucene 全文检索优化的查询分析器 IKQueryParser；采用歧义分析算法优化查询关键字的搜索排列组合，能极大的提高 Lucene 检索的命中率。

IK分词器的安装

1、下载。
Github 下载：https://github.com/medcl/elasticsearch-analysis-ik/releases。
百度网盘下载：https://pan.baidu.com/s/1461yg2y6LvlDyr-a02VHAA。
2、解压下载好的 zip 包，将解压后的 elasticsearch 目录重命名为 ik-analyzer，放到 Elasticsearch 的 plugin 目录下。
3、重启 ES 服务即可生效。

IK分词器测试

IK 分词器提供了两个分词算法分别是 ik_smart 和 ik_max_word，其中 ik_smart 为最少切分，ik_max_word 为最细粒度切分。下面来看一下它们的区别。

最少切分

以 post 方式发送请求 127.0.0.1:9200/_analyze ，请求体如下：

{
	"analyzer":"ik_smart",
	"text":"我是程序员"
}

响应：

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "程序员",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        }
    ]
}

最细粒度切分

以 post 方式发送请求 127.0.0.1:9200/_analyze ，请求体如下：

{
	"analyzer":"ik_max_word",
	"text":"我是程序员"
}

响应：

{
    "tokens": [
        {
            "token": "我",
            "start_offset": 0,
            "end_offset": 1,
            "type": "CN_CHAR",
            "position": 0
        },
        {
            "token": "是",
            "start_offset": 1,
            "end_offset": 2,
            "type": "CN_CHAR",
            "position": 1
        },
        {
            "token": "程序员",
            "start_offset": 2,
            "end_offset": 5,
            "type": "CN_WORD",
            "position": 2
        },
        {
            "token": "程序",
            "start_offset": 2,
            "end_offset": 4,
            "type": "CN_WORD",
            "position": 3
        },
        {
            "token": "员",
            "start_offset": 4,
            "end_offset": 5,
            "type": "CN_CHAR",
            "position": 4
        }
    ]
}

查询测试

创建索引库时指定IK分析器

删除原来的索引库，以 put 方式请求 127.0.0.1:9200/blog，请求体如下：

{
    "mappings": {
        "article": {
            "properties": {
                "id": {
                    "type": "long",
                    "store": true
                },
                "title": {
                    "type": "text",
                    "store": true,
                    "index": true,
                    "analyzer": "ik_smart"
                },
                "content": {
                    "type": "text",
                    "store": true,
                    "index": true,
                    "analyzer": "ik_smart"
                }
            }
        }
    }
}

可通过每个域的 analyzer 属性指定该域使用的分析器。

再次添加测试数据如下：

term查询

以 post 方式请求 127.0.0.1:9200/blog/article/_search，请求体如下：

{
	"query":{
		"term":{
			"title":"发表"
		}
	}
}

响应：

{
    "took": 1,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 0.2876821,
        "hits": [
            {
                "_index": "blog",
                "_type": "article",
                "_id": "1",
                "_score": 0.2876821,
                "_source": {
                    "id": 1,
                    "title": "香港主持人发表肮脏言论：宁愿做英国狗屎上的苍蝇",
                    "content": "“我下辈子再投胎，我宁愿做英国狗拉出来的那坨狗屎上面的那粒苍蝇。”近期在外部势力干预下，香港乱象不断，有香港主持人居然发表如此肮脏的言论。"
                }
            }
        ]
    }
}

querystring查询