使用 ElasticSearch 进行文件名搜索答案

【问题标题】：Filename search with ElasticSearch使用 ElasticSearch 进行文件名搜索
【发布时间】：2012-03-14 08:22:14
【问题描述】：

我想使用 ElasticSearch 来搜索文件名（不是文件的内容）。因此我需要找到文件名的一部分（完全匹配，没有模糊搜索）。

示例：
我有以下名称的文件：

My_first_file_created_at_2012.01.13.doc
My_second_file_created_at_2012.01.13.pdf
Another file.txt
And_again_another_file.docx
foo.bar.txt

现在我想搜索2012.01.13 来获取前两个文件。
搜索 file 或 ile 应返回除最后一个文件名之外的所有文件名。

我怎样才能用 ElasticSearch 做到这一点？

这是我测试过的，但它总是返回零结果：

curl -X DELETE localhost:9200/files
curl -X PUT    localhost:9200/files -d '
{
  "settings" : {
    "index" : {
      "analysis" : {
        "analyzer" : {
          "filename_analyzer" : {
            "type" : "custom",
            "tokenizer" : "lowercase",
            "filter"    : ["filename_stop", "filename_ngram"]
          }
        },
        "filter" : {
          "filename_stop" : {
            "type" : "stop",
            "stopwords" : ["doc", "pdf", "docx"]
          },
          "filename_ngram" : {
            "type" : "nGram",
            "min_gram" : 3,
            "max_gram" : 255
          }
        }
      }
    }
  },

  "mappings": {
    "files": {
      "properties": {
        "filename": {
          "type": "string",
          "analyzer": "filename_analyzer"
        }
      }
    }
  }
}
'

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"


FILES='
http://localhost:9200/files/_search?q=filename:2012.01.13
'

for file in ${FILES}
do
  echo; echo; echo ">>> ${file}"
  curl "${file}&pretty=true"
done

【问题讨论】：

标签： lucene elasticsearch n-gram

【解决方案1】：

我相信这是因为使用了分词器..

http://www.elasticsearch.org/guide/reference/index-modules/analysis/lowercase-tokenizer.html

小写分词器在单词边界上拆分，因此 2012.01.13 将被索引为“2012”、“01”和“13”。搜索字符串“2012.01.13”显然会不匹配。

一种选择是在搜索中添加标记。因此，搜索“2012.01.13”将被标记为与索引中相同的标记，并且将匹配。这也很方便，因为您不需要总是在代码中搜索小写。

第二种选择是使用 n-gram 分词器而不是过滤器。这意味着它将忽略单词边界（并且您也会得到“_”），但是您可能会遇到大小写不匹配的问题，这可能是您首先添加小写标记器的原因。

【讨论】：

第一个选项：我认为我的filename_analyzer在索引和搜索时已经被使用了，因为我没有明确使用index_analyzer/search_analyzer。对于第二个选项：我尝试过这种方式。但是只有当我用"*" 包围关键字时，搜索才会有结果，例如："*2012*"。此外，"*doc*" 创建两个 doc 文件，但 "*.doc*" 仅创建 docx 文件。有什么想法吗？

【解决方案2】：

我没有使用 ES 的经验，但是在 Solr 中，您需要将字段类型指定为文本。您的字段类型为 string 而不是 text。字符串字段，不被分析，而是逐字存储和索引。试一试，看看它是否有效。

properties": {
        "filename": {
          "type": "string",
          "analyzer": "filename_analyzer"
        }

【讨论】：

ES 只使用string 类型，这些都是默认分析的。如果您希望它们逐字存储，则必须将{"index":"not_analyzed"} 添加到映射中

【解决方案3】：

您粘贴的内容存在各种问题：

1) 映射不正确

创建索引时，您指定：

"mappings": {
    "files": {

但你的类型实际上是file，而不是files。如果您检查了映射，您会立即看到：

curl -XGET 'http://127.0.0.1:9200/files/_mapping?pretty=1' 

# {
#    "files" : {
#       "files" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string",
#                "analyzer" : "filename_analyzer"
#             }
#          }
#       },
#       "file" : {
#          "properties" : {
#             "filename" : {
#                "type" : "string"
#             }
#          }
#       }
#    }
# }

2) 分析仪定义不正确

您已指定 lowercase 标记器，但它会删除任何不是字母的内容（请参阅 docs），因此您的数字将被完全删除。

您可以通过analyze API 进行检查：

curl -XGET 'http://127.0.0.1:9200/_analyze?pretty=1&text=My_file_2012.01.13.doc&tokenizer=lowercase' 

# {
#    "tokens" : [
#       {
#          "end_offset" : 2,
#          "position" : 1,
#          "start_offset" : 0,
#          "type" : "word",
#          "token" : "my"
#       },
#       {
#          "end_offset" : 7,
#          "position" : 2,
#          "start_offset" : 3,
#          "type" : "word",
#          "token" : "file"
#       },
#       {
#          "end_offset" : 22,
#          "position" : 3,
#          "start_offset" : 19,
#          "type" : "word",
#          "token" : "doc"
#       }
#    ]
# }

3) Ngram 搜索

您在索引分析器和搜索分析器中都包含您的 ngram 标记过滤器。这对索引分析器来说很好，因为您希望对 ngram 进行索引。但是当你搜索时，你想搜索完整的字符串，而不是每个 ngram。

例如，如果您使用长度为 1 到 4 的 ngram 索引 "abcd"，您最终会得到这些标记：

a b c d ab bc cd abc bcd

但是，如果您搜索 "dcba"（不应该匹配）并且您还使用 ngram 分析您的搜索词，那么您实际上是在搜索：

d c b a dc cb ba dbc cba

所以a,b,c 和d 将匹配！

解决方案

首先，您需要选择正确的分析仪。您的用户可能会搜索单词、数字或日期，但他们可能不会期望 ile 匹配 file。相反，使用edge ngrams 可能会更有用，它将 ngram 锚定到每个单词的开头（或结尾）。

另外，为什么要排除 docx 等？用户肯定很想搜索文件类型吗？

因此，让我们通过删除任何非字母或数字（使用pattern tokenizer）将每个文件名分解为更小的标记：

My_first_file_2012.01.13.doc
=> my first file 2012 01 13 doc

然后对于索引分析器，我们还将在每个标记上使用边缘 ngram：

my     => m my
first  => f fi fir firs first
file   => f fi fil file
2012   => 2 20 201 201
01     => 0 01
13     => 1 13
doc    => d do doc

我们创建索引如下：

curl -XPUT 'http://127.0.0.1:9200/files/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "filename_search" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase"]
            },
            "filename_index" : {
               "tokenizer" : "filename",
               "filter" : ["lowercase","edge_ngram"]
            }
         },
         "tokenizer" : {
            "filename" : {
               "pattern" : "[^\\p{L}\\d]+",
               "type" : "pattern"
            }
         },
         "filter" : {
            "edge_ngram" : {
               "side" : "front",
               "max_gram" : 20,
               "min_gram" : 1,
               "type" : "edgeNGram"
            }
         }
      }
   },
   "mappings" : {
      "file" : {
         "properties" : {
            "filename" : {
               "type" : "string",
               "search_analyzer" : "filename_search",
               "index_analyzer" : "filename_index"
            }
         }
      }
   }
}
'

现在，测试我们的分析器是否正常工作：

文件名搜索：

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_search' 
[results snipped]
"token" : "my"
"token" : "first"
"token" : "file"
"token" : "2012"
"token" : "01"
"token" : "13"
"token" : "doc"

文件名索引：

curl -XGET 'http://127.0.0.1:9200/files/_analyze?pretty=1&text=My_first_file_2012.01.13.doc&analyzer=filename_index' 
"token" : "m"
"token" : "my"
"token" : "f"
"token" : "fi"
"token" : "fir"
"token" : "firs"
"token" : "first"
"token" : "f"
"token" : "fi"
"token" : "fil"
"token" : "file"
"token" : "2"
"token" : "20"
"token" : "201"
"token" : "2012"
"token" : "0"
"token" : "01"
"token" : "1"
"token" : "13"
"token" : "d"
"token" : "do"
"token" : "doc"

好的 - 似乎工作正常。所以让我们添加一些文档：

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_first_file_created_at_2012.01.13.doc" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_second_file_created_at_2012.01.13.pdf" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "Another file.txt" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "And_again_another_file.docx" }'
curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "foo.bar.txt" }'
curl -X POST "http://localhost:9200/files/_refresh"

然后尝试搜索：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "filename" : "2012.01"
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.06780553,
#             "_index" : "files",
#             "_id" : "PsDvfFCkT4yvJnlguxJrrQ",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.06780553,
#             "_index" : "files",
#             "_id" : "ER5RmyhATg-Eu92XNGRu-w",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.06780553,
#       "total" : 2
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 4
# }

成功了！

#### 更新####

我意识到搜索 2012.01 将匹配 2012.01.12 和 2012.12.01，因此我尝试将查询更改为使用 text phrase 查询。然而，这并没有奏效。事实证明，边缘 ngram 过滤器会增加每个 ngram 的位置计数（而我原以为每个 ngram 的位置与单词开头的位置相同）。

上面第 (3) 点中提到的问题仅在使用尝试匹配任何令牌的 query_string、field 或 text 查询时出现问题。但是，对于 text_phrase 查询，它会尝试以正确的顺序匹配所有标记。

为了证明问题，索引另一个日期不同的文档：

curl -X POST "http://localhost:9200/files/file" -d '{ "filename" : "My_third_file_created_at_2012.12.01.doc" }'
curl -X POST "http://localhost:9200/files/_refresh"

然后进行与上面相同的搜索：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text" : {
         "filename" : {
            "query" : "2012.01"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_third_file_created_at_2012.12.01.doc"
#             },
#             "_score" : 0.22097087,
#             "_index" : "files",
#             "_id" : "xmC51lIhTnWplOHADWJzaQ",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.13137488,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.13137488,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.22097087,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 5
# }

第一个结果的日期 2012.12.01 不是 2012.01 的最佳匹配。所以为了只匹配那个确切的短语，我们可以这样做：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "text_phrase" : {
         "filename" : {
            "query" : "2012.01",
            "analyzer" : "filename_index"
         }
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.55737644,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.55737644,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.55737644,
#       "total" : 2
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 7
# }

或者，如果您仍然想匹配所有 3 个文件（因为用户可能会记住文件名中的某些单词，但顺序错误），您可以运行这两个查询，但增加文件名的重要性正确的顺序：

curl -XGET 'http://127.0.0.1:9200/files/file/_search?pretty=1'  -d '
{
   "query" : {
      "bool" : {
         "should" : [
            {
               "text_phrase" : {
                  "filename" : {
                     "boost" : 2,
                     "query" : "2012.01",
                     "analyzer" : "filename_index"
                  }
               }
            },
            {
               "text" : {
                  "filename" : "2012.01"
               }
            }
         ]
      }
   }
}
'

# [Fri Feb 24 16:31:02 2012] Response:
# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "filename" : "My_first_file_created_at_2012.01.13.doc"
#             },
#             "_score" : 0.56892186,
#             "_index" : "files",
#             "_id" : "ZUezxDgQTsuAaCTVL9IJgg",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_second_file_created_at_2012.01.13.pdf"
#             },
#             "_score" : 0.56892186,
#             "_index" : "files",
#             "_id" : "XwLNnSlwSeyYtA2y64WuVw",
#             "_type" : "file"
#          },
#          {
#             "_source" : {
#                "filename" : "My_third_file_created_at_2012.12.01.doc"
#             },
#             "_score" : 0.012931341,
#             "_index" : "files",
#             "_id" : "xmC51lIhTnWplOHADWJzaQ",
#             "_type" : "file"
#          }
#       ],
#       "max_score" : 0.56892186,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 4
# }

【讨论】：

哇，这不仅仅是一个解决方案。这是我正在寻找的教程：D THX
非常感谢您的出色回答！我刚刚注意到，“text*”查询在最新版本的 elasticsearch 中已被弃用，应重命名为“match”和“match_phrase”。
非常感谢。到目前为止它非常有用（太糟糕了，链接被破坏了）。我仍然对一些位感到有些困惑（例如，我知道该模式是 RE，但不清楚 p{L} 是什么）。我将它与match 查询一起使用，我看到的问题是，当我仅在文件名字段中搜索时，它似乎可以工作，但在使用_all 时却不起作用:(。有什么想法吗？跨度>
@DrTech：感谢您的出色回答...我在搜索时遇到了一些问题。我的感觉插件出现错误。 "type": "query_parsing_exception", "reason": "No query registered for [text]", 有没有人遇到同样的错误？
@ASN：将“文本”更改为“匹配”，它应该可以正常工作