【发布时间】:2021-10-09 14:08:16
【问题描述】:
我正在创建一个获取文档的索引器,运行 KeyPhraseExtractionSkill 并将其输出回索引。
对于许多文档,这是开箱即用的。但是对于那些超过 50,000 的记录,这不起作用。好的,没问题;这在文档中有明确说明。
文档建议使用文本拆分技能。我所做的是使用文本拆分技能,将原始文档拆分为页面,将所有页面传递给 KeyPhraseExtractionSkill。然后我们需要将它们合并回来,因为我们最终会得到一个字符串数组。不幸的是,合并技能似乎不接受数组数组,只是一个数组。
https://i.imgur.com/dBD4qgb.png
这是 Azure 报告的错误:
Required skill input was not of the expected type 'StringCollection'. Name: 'itemsToInsert', Source: '/document/content/pages/*/keyPhrases'. Expression language parsing issues:
我最终想要实现的是对大于 50,000 的文本运行 KeyPhraseExtractionSkill 以最终将其添加回索引。
技能组合的 JSON
"@odata.context": "https://-----------.search.windows.net/$metadata#skillsets/$entity",
"@odata.etag": "\"0x8D957466A2C1E47\"",
"name": "devalbertcollectionfilesskillset2",
"description": null,
"skills": [
{
"@odata.type": "#Microsoft.Skills.Text.SplitSkill",
"name": "SplitSkill",
"description": null,
"context": "/document/content",
"defaultLanguageCode": "en",
"textSplitMode": "pages",
"maximumPageLength": 1000,
"inputs": [
{
"name": "text",
"source": "/document/content"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "pages"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.EntityRecognitionSkill",
"name": "EntityRecognitionSkill",
"description": null,
"context": "/document/content/pages/*",
"categories": [
"person",
"quantity",
"organization",
"url",
"email",
"location",
"datetime"
],
"defaultLanguageCode": "en",
"minimumPrecision": null,
"includeTypelessEntities": null,
"inputs": [
{
"name": "text",
"source": "/document/content/pages/*"
}
],
"outputs": [
{
"name": "persons",
"targetName": "people"
},
{
"name": "organizations",
"targetName": "organizations"
},
{
"name": "entities",
"targetName": "entities"
},
{
"name": "locations",
"targetName": "locations"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.KeyPhraseExtractionSkill",
"name": "KeyPhraseExtractionSkill",
"description": null,
"context": "/document/content/pages/*",
"defaultLanguageCode": "en",
"maxKeyPhraseCount": null,
"modelVersion": null,
"inputs": [
{
"name": "text",
"source": "/document/content/pages/*"
}
],
"outputs": [
{
"name": "keyPhrases",
"targetName": "keyPhrases"
}
]
},
{
"@odata.type": "#Microsoft.Skills.Text.MergeSkill",
"name": "Merge Skill - keyPhrases",
"description": null,
"context": "/document",
"insertPreTag": " ",
"insertPostTag": " ",
"inputs": [
{
"name": "itemsToInsert",
"source": "/document/content/pages/*/keyPhrases"
}
],
"outputs": [
{
"name": "mergedText",
"targetName": "keyPhrases"
}
]
}
],
"cognitiveServices": {
"@odata.type": "#Microsoft.Azure.Search.CognitiveServicesByKey",
"key": "------",
"description": "/subscriptions/13abe1c6-d700-4f8f-916a-8d3bc17bb41e/resourceGroups/mde-dev-rg/providers/Microsoft.CognitiveServices/accounts/mde-dev-cognitive"
},
"knowledgeStore": null,
"encryptionKey": null
}```
Please let me know if there is anything else that I can add to improve the question. Thanks!
[1]: https://i.stack.imgur.com/GNf7F.png
【问题讨论】:
-
可能想要删除您的认知服务密钥 ;) 至于解决方案,最简单的方法是进行两次合并,一个用于每个页面上的每个关键短语数组,然后另一个合并一次所有页面有一个关键词文本(因为它们被合并了)
-
嗨@JenniferMarsman-MSFT,感谢您的评论。事实上,我是从那个问题开始的,并以此作为参考。在我的技能(在上面的 JSON 中注明)中,我确实使用了它 - 我正在传递 keyPhrases 并期望它合并到 KeyPhrases 中。但是该技能不接受这一点,因为它似乎不喜欢数组数组(所需的技能输入不是预期的类型'StringCollection')
标签: azure-cognitive-search azure-search-.net-sdk