【问题标题】:How to search for part of a string with Azure Cognitive Search如何使用 Azure 认知搜索搜索字符串的一部分
【发布时间】:2020-10-15 14:11:32
【问题描述】:

我是 Azure 认知搜索的新手,并且已成功配置我的索引以实现自动完成(感谢 this article 使用部分搜索)。

但现在我有另一个用例,其中我有许多文件存储在带有元数据的 Azure Blob 容器中:

(每个文件的)元数据字段之一称为 partnumbers,其值是用逗号分隔的产品 SKU 字符串(如“123456,78901,102938,09876”)。 我已经建立了索引,以便将此信息存储为 Edm.String,如下所示:

{
  "name": "my-index",
  "fields": [
    {
      "name": "partnumbers",
      "type": "Edm.String",
      "facetable": true,
      "filterable": true,
      "key": false,
      "retrievable": true,
      "searchable": true,
      "sortable": true,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    },
    {
      "name": "metadata_storage_name",
      "type": "Edm.String",
      "facetable": true,
      "filterable": true,
      "key": false,
      "retrievable": false,
      "searchable": true,
      "sortable": true,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    },
    {
      "name": "metadata_storage_content_type",
      "type": "Edm.String",
      "facetable": true,
      "filterable": true,
      "key": false,
      "retrievable": false,
      "searchable": true,
      "sortable": true,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    },
    {
      "name": "metadata_storage_last_modified",
      "type": "Edm.String",
      "facetable": true,
      "filterable": true,
      "key": false,
      "retrievable": false,
      "searchable": true,
      "sortable": true,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    },
    {
      "name": "metadata_storage_path",
      "type": "Edm.String",
      "facetable": true,
      "filterable": true,
      "key": false,
      "retrievable": false,
      "searchable": true,
      "sortable": true,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    },
    {
      "name": "metadata_storage_size",
      "type": "Edm.Int64",
      "facetable": true,
      "filterable": true,
      "retrievable": false,
      "sortable": true,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    },
    {
      "name": "key",
      "type": "Edm.String",
      "facetable": true,
      "filterable": true,
      "key": true,
      "retrievable": true,
      "searchable": true,
      "sortable": true,
      "analyzer": null,
      "indexAnalyzer": null,
      "searchAnalyzer": null,
      "synonymMaps": [],
      "fields": []
    },
    {
      "name": "partialPartnumbers",
      "type": "Edm.String",
      "facetable": false,
      "filterable": false,
      "key": false,
      "retrievable": false,
      "searchable": true,
      "sortable": false,
      "analyzer": null,
      "indexAnalyzer": "prefixCmAnalyzer",
      "searchAnalyzer": "standardCmAnalyzer",
      "synonymMaps": [],
      "fields": []
    },
  ],
  "suggesters": [
    {
      "name": "my-index_suggester",
      "searchMode": "analyzingInfixMatching",
      "sourceFields": [
        "partnumbers"
      ]
    }
  ],
  "scoringProfiles": [
    {
      "name": "exactFirst",
      "functions": [],
      "functionAggregation": null,
      "text": {
        "weights": {
          "partnumbers": 2,
          "partialPartnumbers": 1,
        }
      }
    }
  ],
  "defaultScoringProfile": "exactFirst",
  "corsOptions": null,
  "analyzers": [
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "standardCmAnalyzer",
      "tokenizer": "standard_v2",
      "tokenFilters": [
        "lowercase",
        "asciifolding"
      ],
      "charFilters": []
    },
    {
      "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
      "name": "prefixCmAnalyzer",
      "tokenizer": "standard_v2",
      "tokenFilters": [
        "lowercase",
        "asciifolding",
        "edgeNGramCmTokenFilter"
      ],
      "charFilters": []
    }
  ],
  "charFilters": [],
  "tokenFilters": [
    {
      "@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
      "name": "edgeNGramCmTokenFilter",
      "minGram": 2,
      "maxGram": 20,
      "side": "front"
    }
  ],
  "tokenizers": [],
  "@odata.etag": "\"0x8D8184F367A74XX\""
}

现在我正在努力寻找一种方法(通过特定语法?分析器?标记器?)能够找到所有具有 partnumbers 元数据字段的文件,该字段包含一个 SKU (以便我可以检索与一种产品相关的所有文档):我想将 SKU“102938”传递给 Azure 搜索,它会返回所有在其 partnumbers 中包含此 SKU 的文件元数据字段(可能还有其他 SKU)。

但是我很难在 Google 上找到示例,而且文档似乎 - 目前 - 有点超出我的范围(我不太确定正确理解什么是分析器、标记器等以及它们是如何工作的!这是我第一次深入“搜索”世界……)。

因此,我非常感谢社区可以在这方面帮助我,我很想阅读适合初学者的文章,以了解一切,或教程,或任何可以帮助我继续前进的东西!

提前致谢。

【问题讨论】:

    标签: azure search azure-blob-storage azure-cognitive-search


    【解决方案1】:

    好的,我刚刚尝试了一些可行的方法:我在 partnumbers 字段中定义了 pattern analyzer,当我使用 Analyzer Text API 进行测试时,它确实将我的 SKU 拆分为多个令牌。 之后我可以搜索一个 SKU,它给了我我想要的所有文件! 这是我的索引 JSON 定义:

    {
      "name": "my-index",
      "fields": [
        {
          "name": "partnumbers",
          "type": "Edm.String",
          "facetable": true,
          "filterable": true,
          "key": false,
          "retrievable": true,
          "searchable": true,
          "sortable": true,
          "analyzer": "pattern",
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "metadata_storage_name",
          "type": "Edm.String",
          "facetable": true,
          "filterable": true,
          "key": false,
          "retrievable": true,
          "searchable": true,
          "sortable": true,
          "analyzer": null,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "metadata_storage_content_type",
          "type": "Edm.String",
          "facetable": true,
          "filterable": true,
          "key": false,
          "retrievable": true,
          "searchable": true,
          "sortable": true,
          "analyzer": null,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "metadata_storage_last_modified",
          "type": "Edm.String",
          "facetable": true,
          "filterable": true,
          "key": false,
          "retrievable": true,
          "searchable": true,
          "sortable": true,
          "analyzer": null,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "metadata_storage_path",
          "type": "Edm.String",
          "facetable": true,
          "filterable": true,
          "key": false,
          "retrievable": true,
          "searchable": true,
          "sortable": true,
          "analyzer": null,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "metadata_storage_size",
          "type": "Edm.Int64",
          "facetable": true,
          "filterable": true,
          "retrievable": true,
          "sortable": true,
          "analyzer": null,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "key",
          "type": "Edm.String",
          "facetable": true,
          "filterable": true,
          "key": true,
          "retrievable": true,
          "searchable": true,
          "sortable": true,
          "analyzer": null,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "name",
          "type": "Edm.String",
          "facetable": true,
          "filterable": true,
          "key": false,
          "retrievable": true,
          "searchable": true,
          "sortable": true,
          "analyzer": null,
          "indexAnalyzer": null,
          "searchAnalyzer": null,
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "partialPartnumbers",
          "type": "Edm.String",
          "facetable": false,
          "filterable": false,
          "key": false,
          "retrievable": false,
          "searchable": true,
          "sortable": false,
          "analyzer": null,
          "indexAnalyzer": "prefixCmAnalyzer",
          "searchAnalyzer": "standardCmAnalyzer",
          "synonymMaps": [],
          "fields": []
        },
        {
          "name": "partialName",
          "type": "Edm.String",
          "facetable": false,
          "filterable": false,
          "key": false,
          "retrievable": false,
          "searchable": true,
          "sortable": false,
          "analyzer": null,
          "indexAnalyzer": "prefixCmAnalyzer",
          "searchAnalyzer": "standardCmAnalyzer",
          "synonymMaps": [],
          "fields": []
        }
      ],
      "suggesters": [
        {
          "name": "conformity-certificates-index_suggester",
          "searchMode": "analyzingInfixMatching",
          "sourceFields": [
            "name"
          ]
        }
      ],
      "scoringProfiles": [
        {
          "name": "exactFirst",
          "functions": [],
          "functionAggregation": null,
          "text": {
            "weights": {
              "partnumbers": 4,
              "partialPartnumbers": 3,
              "name": 2,
              "partialName": 1
            }
          }
        }
      ],
      "defaultScoringProfile": "exactFirst",
      "corsOptions": null,
      "analyzers": [
        {
          "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
          "name": "standardCmAnalyzer",
          "tokenizer": "standard_v2",
          "tokenFilters": [
            "lowercase",
            "asciifolding"
          ],
          "charFilters": []
        },
        {
          "@odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
          "name": "prefixCmAnalyzer",
          "tokenizer": "standard_v2",
          "tokenFilters": [
            "lowercase",
            "asciifolding",
            "edgeNGramCmTokenFilter"
          ],
          "charFilters": []
        }
      ],
      "charFilters": [],
      "tokenFilters": [
        {
          "@odata.type": "#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
          "name": "edgeNGramCmTokenFilter",
          "minGram": 2,
          "maxGram": 20,
          "side": "front"
        }
      ],
      "tokenizers": [],
      "@odata.etag": "\"0x8D818EC80CXXXX\""
    }
    

    【讨论】:

      【解决方案2】:

      您可以使用常规过滤器搜索您的零件编号。

      $filter=search.in(partnumbers, '102938', ',')

      您可以在此处的文档中找到更多示例:https://docs.microsoft.com/en-us/azure/search/search-query-odata-filter

      不要在此用例中使用通配符或正则表达式。您的示例具有不同长度的部件号。因此,通配符搜索 102938* 会无意中也匹配 1029381、10293810、102938123 等。

      您的数据已经明确且准确地列出了一组零件编号。您可以查询该列表。

      【讨论】:

        【解决方案3】:

        这应该可以通过正则表达式和通配符搜索来实现

        这可以应用于在索引上配置了Lucene 查询分析器的任何可搜索字段。

        "....通过设置 queryType=full 获得的完整 Lucene 查询语言通过添加对更多运算符和查询类型(如通配符、模糊、正则表达式和字段范围查询)的支持来扩展默认的简单查询语言。例如,以简单查询语法发送的正则表达式将被解释为查询字符串而不是表达式。本文中的示例请求使用完整的 Lucene 查询语言。"

        fieldName:searchExpression

        例如searchFields=partnumbers&$select=partnumbers&search=partnumbers:102938*

        https://docs.microsoft.com/en-us/azure/search/query-lucene-syntax

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 1970-01-01
          • 2021-03-26
          • 2022-01-24
          • 2018-01-05
          • 2020-03-27
          • 2021-10-18
          • 2021-12-24
          相关资源
          最近更新 更多