Elasticsearch - River 和 nGrams答案

【问题标题】：Elasticsearch - River and nGramsElasticsearch - River 和 nGrams
【发布时间】：2023-03-08 20:08:01
【问题描述】：

我正在使用带有 River 插件的 ES，因为我正在使用 couchDB，并且我正在尝试使用 nGrams 进行查询。我基本上已经完成了我需要的所有事情，除了当有人输入空格时，查询无法正常工作。那是因为 ES 将查询的每个元素都标记化为空格。

这是我需要做的：

查询字符串中的部分文本：

查询：“Hello Wor”响应：“Hello World，Hello Word”/排除“Hello，World，Word”
按我指定的标准对结果进行排序；
不区分大小写。

这是我在这个问题之后所做的：How to search for a part of a word with ElasticSearch

curl -X PUT  'localhost:9200/_river/myDB/_meta' -d '
{
"type" : "couchdb",
"couchdb" : {
    "host" : "localhost",
    "port" : 5984,
    "db" : "myDB",
    "filter" : null
},
   "index" : {
    "index" : "myDB",
    "type" : "myDB",
    "bulk_size" : "100",
    "bulk_timeout" : "10ms",
    "analysis" : {
               "index_analyzer" : {
                          "my_index_analyzer" : {
                                        "type" : "custom",
                                        "tokenizer" : "standard",
                                        "filter" : ["lowercase", "mynGram"]
                          }
               },
               "search_analyzer" : {
                          "my_search_analyzer" : {
                                        "type" : "custom",
                                        "tokenizer" : "standard",
                                        "filter" : ["standard", "lowercase", "mynGram"]
                          }
               },
               "filter" : {
                        "mynGram" : {
                                   "type" : "nGram",
                                   "min_gram" : 2,
                                   "max_gram" : 50
                        }
               }
    }
}
}
'

然后我将为排序添加一个映射：

curl -s -XGET 'localhost:9200/myDB/myDB/_mapping' 
{
"sorting": {
       "Title": {
            "fields": {
                "Title": {
                     "type": "string"
                  }, 
                "untouched": {
                    "include_in_all": false, 
                    "index": "not_analyzed", 
                    "type": "string"
                    }
               }, 
              "type": "multi_field"
         },
        "Year": {
              "fields": {
                   "Year": {
                       "type": "string"
                       }, 
                       "untouched": {
                           "include_in_all": false, 
                           "index": "not_analyzed", 
                           "type": "string"
                         }
                     }, 
                    "type": "multi_field"
        }
     }
    }
   }'

我已经添加了我使用的所有信息，只是为了完整。无论如何，通过这个设置，我认为应该可以工作，每当我尝试获得一些结果时，空间仍然用于分割我的查询，例如：

  http://localhost:9200/myDB/myDB/_search?q=Title:(Hello%20Wor)&pretty=true

返回包含“Hello”和“Wor”的任何内容（我通常不使用括号，但我在示例中看到过它们，结果看起来仍然非常相似）。

真正感谢任何帮助，因为这让我非常烦恼。

更新：最后，我意识到我不需要 nGram。一个正常的索引就可以了；只需用“ AND ”替换查询的空格即可。

例子：

 Query: "Hello World"  --->  Replaced as "(*Hello And World*)"

【问题讨论】：

你试过q=Title:(+Hello +Wor)
我发现 q=Title:( * Hello AND Wor * ) 有效
NGrams 的问题在于您在空格上进行标记。我猜你可以使用关键字标记器而不是标准的。
小心使用通配符。它会减慢您的查询速度！ Ngram 恕我直言是更好的选择。

标签： database lucene couchdb elasticsearch n-gram

【解决方案1】：

现在没有弹性搜索设置，但也许这对 doc 有帮助？

http://www.elasticsearch.org/guide/reference/query-dsl/match-query.html

Types of Match Queries

boolean

The default match query is of type boolean. It means that the text provided is analyzed and the analysis process constructs a boolean query from the provided text. The operator flag can be set to or or and to control the boolean clauses (defaults to or).

The analyzer can be set to control which analyzer will perform the analysis process on the text. It default to the field explicit mapping definition, or the default search analyzer.

fuzziness can be set to a value (depending on the relevant type, for string types it should be a value between 0.0 and 1.0) to constructs fuzzy queries for each term analyzed. The prefix_length and max_expansions can be set in this case to control the fuzzy process. If the fuzzy option is set the query will use constant_score_rewrite as its rewrite method the rewrite parameter allows to control how the query will get rewritten.

Here is an example when providing additional parameters (note the slight change in structure, message is the field name):

{
    "match" : {
        "message" : {
            "query" : "this is a test",
            "operator" : "and"
        }
    }
}

【讨论】：