wufengtinghai

拼音搜索在中文搜索环境中是经常使用的一种功能,用户只需要输入关键词的拼音全拼或者拼音首字母,搜索引擎就可以搜索出相关结果。在国内,中文输入法基本上都是基于汉语拼音的,这种在符合用户输入习惯的条件下缩短用户输入时间的功能是非常受欢迎的;

一、安装拼音搜索插件

下载对应版本的elasticsearch-analysis-pinyin插件;

https://github.com/medcl/elasticsearch-analysis-pinyin/releases/download/v7.9.2/elasticsearch-analysis-pinyin-7.9.2.zip

在elasticsearch安装目录下的的plugin目录新建analysis-pinyin目录,并解压下载的安装包;

重启elasticsearch,可以看到已经正常加载拼音插件

[2022-01-13T20:37:25,368][INFO ][o.e.p.PluginsService     ] [mango] loaded plugin [analysis-pinyin]

二、使用拼音插件

试一下分词效果,可以看到除了每个词的全频,还有每个字的首字母缩写;

POST _analyze
{
  "analyzer": "pinyin",
  "text": "我爱你,中国"
}

{
  "tokens" : [
    {
      "token" : "wo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "wanzg",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "ai",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "ni",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "zhong",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "guo",
      "start_offset" : 0,
      "end_offset" : 0,
      "type" : "word",
      "position" : 4
    }
  ]
}

自定义pinyin filter,并创建mapping;

PUT /milk
{
  "settings": {
    "analysis": {
      "filter": {
        "pinyin_filter":{
          "type":"pinyin",
          "keep_separate_first_letter" : false,
          "keep_full_pinyin" : true,
          "keep_original" : true,
          "limit_first_letter_length" : 16,
          "lowercase" : true,
          "remove_duplicated_term" : true
}
      },
      "analyzer": {
        "ik_pinyin_analyzer":{
          "tokenizer":"ik_max_word",
           "filter":["pinyin_filter"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "brand":{
        "type": "text",
        "analyzer": "ik_pinyin_analyzer"
      },
      "series":{
        "type": "text",
        "analyzer": "ik_pinyin_analyzer"
      },
      "price":{
        "type": "float"
      }
    }
  }
}

批量索引文档;

POST _bulk
{"index":{"_index":"milk", "_id":1}}}
{"brand":"蒙牛", "series":"特仑苏", "price":60}
{"index":{"_index":"milk", "_id":2}}}
{"brand":"蒙牛", "series":"真果粒", "price":40}
{"index":{"_index":"milk", "_id":3}}}
{"brand":"华山牧", "series":"华山牧", "price":49.90}
{"index":{"_index":"milk", "_id":4}}}
{"brand":"伊利", "series":"安慕希", "price":49.90}
{"index":{"_index":"milk", "_id":5}}}
{"brand":"伊利", "series":"金典", "price":49.90}

搜索tls,可以看到已经匹配到对应的记录;

POST milk/_search
{
  
  "query": {
    "match_phrase_prefix": {
      "series": "tl"
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 6.691126,
    "hits" : [
      {
        "_index" : "milk",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 6.691126,
        "_source" : {
          "brand" : "蒙牛",
          "series" : "特仑苏",
          "price" : 60
        }
      }
    ]
  }
}


可以看到查询直接使用MultiPhraseQuery来实现对两个位置字符的定位;

POST milk/_search
{
  
  "query": {
    "match_phrase_prefix": {
      "series": "tl"
    }
  },
  "highlight": {
    "fields": {
      "series": {}
    }
  }
}

{
  "profile" : {
    "shards" : [
      {
        "id" : "[OoNXoregTmKQAFotUgOeaA][milk][0]",
        "searches" : [
          {
            "query" : [
              {
                "type" : "MultiPhraseQuery",
                "description" : "series:\"(t tl) (li l lun)\"",
                "time_in_nanos" : 177400,
                "breakdown" : {
                  "set_min_competitive_score_count" : 0,
                  "match_count" : 1,
                  "shallow_advance_count" : 0,
                  "set_min_competitive_score" : 0,
                  "next_doc" : 5400,
                  "match" : 12800,
                  "next_doc_count" : 1,
                  "score_count" : 1,
                  "compute_max_score_count" : 0,
                  "compute_max_score" : 0,
                  "advance" : 9200,
                  "advance_count" : 1,
                  "score" : 3900,
                  "build_scorer_count" : 2,
                  "create_weight" : 40800,
                  "shallow_advance" : 0,
                  "create_weight_count" : 1,
                  "build_scorer" : 105300
                }
              }
            ],
            "rewrite_time" : 45700,
            "collector" : [
              {
                "name" : "SimpleTopScoreDocCollector",
                "reason" : "search_top_hits",
                "time_in_nanos" : 14500
              }
            ]
          }
        ],
        "aggregations" : [ ]
      }
    ]
  }
}

查询也可以返回高亮信息

POST milk/_search
{
  
  "query": {
    "match_phrase_prefix": {
      "series": "tl"
    }
  },
  "highlight": {
    "fields": {
      "series": {}
    }
  }
}



{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 6.691126,
    "hits" : [
      {
        "_index" : "milk",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 6.691126,
        "_source" : {
          "brand" : "蒙牛",
          "series" : "特仑苏",
          "price" : 60
        },
        "highlight" : {
          "series" : [
            "<em>特</em><em>仑</em>苏"
          ]
        }
      }
    ]
  }
}


相关文章: