Azure 搜索 - 无法合并（使用技能）从 KeyPhraseExtractionSkill 获得的数据答案

【问题标题】：Azure Search - Cannot merge (with skill) data obtained from the KeyPhraseExtractionSkillAzure 搜索 - 无法合并（使用技能）从 KeyPhraseExtractionSkill 获得的数据
【发布时间】：2021-10-09 14:08:16
【问题描述】：

我正在创建一个获取文档的索引器，运行 KeyPhraseExtractionSkill 并将其输出回索引。

对于许多文档，这是开箱即用的。但是对于那些超过 50,000 的记录，这不起作用。好的，没问题；这在文档中有明确说明。

文档建议使用文本拆分技能。我所做的是使用文本拆分技能，将原始文档拆分为页面，将所有页面传递给 KeyPhraseExtractionSkill。然后我们需要将它们合并回来，因为我们最终会得到一个字符串数组。不幸的是，合并技能似乎不接受数组数组，只是一个数组。

https://i.imgur.com/dBD4qgb.png

这是 Azure 报告的错误：

Required skill input was not of the expected type 'StringCollection'. Name: 'itemsToInsert', Source: '/document/content/pages/*/keyPhrases'. Expression language parsing issues:

我最终想要实现的是对大于 50,000 的文本运行 KeyPhraseExtractionSkill 以最终将其添加回索引。

技能组合的 JSON

  "@odata.context": "https://-----------.search.windows.net/$metadata#skillsets/$entity",
  "@odata.etag": "\"0x8D957466A2C1E47\"",
  "name": "devalbertcollectionfilesskillset2",
  "description": null,
  "skills": [
    {
      "@odata.type": "#Microsoft.Skills.Text.SplitSkill",
      "name": "SplitSkill",
      "description": null,
      "context": "/document/content",
      "defaultLanguageCode": "en",
      "textSplitMode": "pages",
      "maximumPageLength": 1000,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content"
        }
      ],
      "outputs": [
        {
          "name": "textItems",
          "targetName": "pages"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
      "name": "EntityRecognitionSkill",
      "description": null,
      "context": "/document/content/pages/*",
      "categories": [
        "person",
        "quantity",
        "organization",
        "url",
        "email",
        "location",
        "datetime"
      ],
      "defaultLanguageCode": "en",
      "minimumPrecision": null,
      "includeTypelessEntities": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "persons",
          "targetName": "people"
        },
        {
          "name": "organizations",
          "targetName": "organizations"
        },
        {
          "name": "entities",
          "targetName": "entities"
        },
        {
          "name": "locations",
          "targetName": "locations"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
      "name": "KeyPhraseExtractionSkill",
      "description": null,
      "context": "/document/content/pages/*",
      "defaultLanguageCode": "en",
      "maxKeyPhraseCount": null,
      "modelVersion": null,
      "inputs": [
        {
          "name": "text",
          "source": "/document/content/pages/*"
        }
      ],
      "outputs": [
        {
          "name": "keyPhrases",
          "targetName": "keyPhrases"
        }
      ]
    },
    {
      "@odata.type": "#Microsoft.Skills.Text.MergeSkill",
      "name": "Merge Skill - keyPhrases",
      "description": null,
      "context": "/document",
      "insertPreTag": " ",
      "insertPostTag": " ",
      "inputs": [
        {
          "name": "itemsToInsert",
          "source": "/document/content/pages/*/keyPhrases"
        }
      ],
      "outputs": [
        {
          "name": "mergedText",
          "targetName": "keyPhrases"
        }
      ]
    }
  ],
  "cognitiveServices": {
    "@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
    "key": "------",
    "description": "/subscriptions/13abe1c6-d700-4f8f-916a-8d3bc17bb41e/resourceGroups/mde-dev-rg/providers/Microsoft.CognitiveServices/accounts/mde-dev-cognitive"
  },
  "knowledgeStore": null,
  "encryptionKey": null
}```

Please let me know if there is anything else that I can add to improve the question. Thanks!


  [1]: https://i.stack.imgur.com/GNf7F.png

【问题讨论】：

可能想要删除您的认知服务密钥 ;) 至于解决方案，最简单的方法是进行两次合并，一个用于每个页面上的每个关键短语数组，然后另一个合并一次所有页面有一个关键词文本（因为它们被合并了）
见stackoverflow.com/questions/61491809/…。
嗨@JenniferMarsman-MSFT，感谢您的评论。事实上，我是从那个问题开始的，并以此作为参考。在我的技能（在上面的 JSON 中注明）中，我确实使用了它 - 我正在传递 keyPhrases 并期望它合并到 KeyPhrases 中。但是该技能不接受这一点，因为它似乎不喜欢数组数组（所需的技能输入不是预期的类型'StringCollection'）

标签： azure-cognitive-search azure-search-.net-sdk

【解决方案1】：

您不必合并关键短语输出即可将它们插入索引。

假设您的索引已经有一个名为mykeyphrases 类型为Collection(Edm.String) 的字段，要使用关键短语输出填充它，请添加此indexer output field mapping：

"outputFieldMappings": [
  ...

  {
    "sourceFieldName": "/document/content/pages/*/keyPhrases/*",
    "targetFieldName": "mykeyphrases"
  },

  ...
]

sourceFieldName 末尾的 /* 对于展平字符串数组非常重要。如果您想将字符串数组传递给另一个技能以进行其他扩充，这也可以作为技能输入。

【讨论】：

谢谢！这正是我所需要的。