在 Elasticsearch 中,有时候需要做前缀模糊搜索是一种近似匹配的搜索方式。
prefix查询也就是前缀查询,查询指定field字段包含特定前缀的文档。
如下例子:
PUT test-index
PUT test-index/_doc/1
{
"full_name": "wangwu"
}
PUT test-index/_doc/22
{
"full_name": "li"
}
PUT test-index/_doc/111
{
"full_name": "wusun"
}
PUT test-index/_doc/1111
{
"full_name": "a"
}
GET test-index/_search
{
"query": {
"prefix": {
"full_name":"wus"
}
}
}
结果如下:
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "test-index",
"_type" : "_doc",
"_id" : "111",
"_score" : 1.0,
"_source" : {
"full_name" : "wusun"
}
}
]
}
}
前缀查询的过程:
1、扫描倒排索引并查询第一个以wus开头的词
2、收集相关联的文档的id
3、继续移动到下一条倒排索引
4、如果这个词还是以wus开头,查询则回到step2重复执行
如果index里的doc比较少,上述这种方式还是没啥问题的。但是如果doc很多,这个前缀查询则可能会比较慢了。
为此, ES在高版本中引入了index_prefixes能力,它本质上就是空间换时间
(提前把需要关注的field的前缀数据存起来)。
index_prefixes的相关基础:
index_prefixes参数允许对词条前缀进行索引,以加速前缀搜索。它接受以下可选设置:
min_chars:索引的最小前缀长度(包含),必须大于0,默认值为2。
max_chars:索引的最大前缀长度(包含),必须小于20,默认值为5。
index_prefixe可以理解为在索引上又建了层索引,会为词项再创建倒排索引,会加快前缀搜索的时间,但是会浪费大量空间,本质还是空间换时间。
具体看如下的例子:
# 创建索引,这里我们把默认的min_chars改为3
PUT my-index-000001
{
"mappings": {
"properties": {
"full_name": {
"type": "text",
"index_prefixes": {
"min_chars" : 3,
"max_chars" : 10
}
}
}
}
}
上面mapping中的index_prefixes 参数,指示 Elasticsearch 创建一个子字段"._index_prefix"。该字段将用于执行快速前缀查询。
在进行高亮显示时,将"._index_prefix"子字段添加到 matched_fields 参数中,
以便根据前缀字段找到的匹配项高亮显示主字段。
PUT my-index-000001/_doc/1
{
"full_name": "wangwu"
}
PUT my-index-000001/_doc/22
{
"full_name": "li"
}
PUT my-index-000001/_doc/111
{
"full_name": "wusun"
}
PUT my-index-000001/_doc/1111
{
"full_name": "a"
}
GET my-index-000001/_search
{
"query": {
"prefix": {
"full_name": {
"value": "wus"
}
}
},
"highlight": {
"fields": {
"full_name": {
"matched_fields": ["full_name._index_prefix"]
}
}
}
}
查询结果如下:
{
"took" : 1,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 1,
"relation" : "eq"
},
"max_score" : 1.0,
"hits" : [
{
"_index" : "my-index-000001",
"_type" : "_doc",
"_id" : "111",
"_score" : 1.0,
"_source" : {
"full_name" : "wusun"
}
}
]
}
}
# 试下非前缀的查询
GET my-index-000001/_search
{
"query": {
"prefix": {
"full_name": {
"value": "sun"
}
}
},
"highlight": {
"fields": {
"full_name": {
"matched_fields": ["full_name._index_prefix"]
}
}
}
}
返回结果如下,为空:
{
"took" : 0,
"timed_out" : false,
"_shards" : {
"total" : 3,
"successful" : 3,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : {
"value" : 0,
"relation" : "eq"
},
"max_score" : null,
"hits" : [ ]
}
}
# 再试下查询前面2个字符的情况,可以看到报错了
GET my-index-000001/_search
{
"query": {
"prefix": {
"full_name": {
"value": "wu"
}
}
},
"highlight": {
"fields": {
"full_name": {
"matched_fields": ["full_name._index_prefix"]
}
}
}
}
这个查询报错了,如下:
{
"error" : {
"root_cause" : [
{
"type" : "query_shard_exception",
"reason" : "failed to create query: Cannot invoke \"Object.hashCode()\" because \"this.rewriteMethod\" is null",
"index_uuid" : "CwAqAEaHRRqH1_92Egyrlw",
"index" : "my-index-000001"
}
],
"type" : "search_phase_execution_exception",
"reason" : "all shards failed",
"phase" : "query",
"grouped" : true,
"failed_shards" : [
{
"shard" : 0,
"index" : "my-index-000001",
"node" : "CMjzwiULR0GBf50Wn7-FiQ",
"reason" : {
"type" : "query_shard_exception",
"reason" : "failed to create query: Cannot invoke \"Object.hashCode()\" because \"this.rewriteMethod\" is null",
"index_uuid" : "CwAqAEaHRRqH1_92Egyrlw",
"index" : "my-index-000001",
"caused_by" : {
"type" : "null_pointer_exception",
"reason" : "Cannot invoke \"Object.hashCode()\" because \"this.rewriteMethod\" is null"
}
}
}
]
},
"status" : 400
}
可以看到,入参的字符串长度太短,低于我们在mapping中定义的min_chars是无法使用这个index_prefixes查询的。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。
原创声明:本文系作者授权腾讯云开发者社区发表,未经许可,不得转载。
如有侵权,请联系 cloudcommunity@tencent.com 删除。