Azure 搜索：索引 ZIP 存档中的纯文本答案

【问题标题】：Azure Search: Indexing plain text inside ZIP archiveAzure 搜索：索引 ZIP 存档中的纯文本
【发布时间】：2023-01-05 07:11:09
【问题描述】：

我正在尝试在 Azure 搜索中的 Azure 文件上托管的几个压缩存档中为纯文本文件编制索引，但我遇到了各种问题，并且文档在索引 ZIP 存档中的内容方面非常有限。

zip 文件已编入索引，但我无法“破解”档案以索引其中的文本文件； content 字段试图保存整个 zip 文件。内容字段是否需要更改为“复杂类型”？
我尝试进行测试以删除“内容”索引字段，现在遇到了一个限制“文档是‘27789211’字节，这超过了当前服务层文档提取的最大大小‘16777216’字节。要忽略此错误并继续索引超大 blob 的存储元数据，请将“indexStorageMetadataOnlyForOversizedDocuments”配置参数设置为 true。 - Azure 搜索 SKU 是基本的

指数：


    {
      "name" : "zipindex",
      "fields": [
          { "name": "ID", "type": "Edm.String", "key": true, "searchable": false },
          { "name": "metadata_storage_name", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true  },
          { "name": "metadata_storage_path", "type": "Edm.String", "searchable": false, "filterable": true, "sortable": true },
          { "name": "metadata_storage_size", "type": "Edm.Int64", "searchable": false, "filterable": true, "sortable": true  },
          { "name": "metadata_storage_content_type", "type": "Edm.String", "searchable": true, "filterable": true, "sortable": true }     
      ]
    }

索引器：


    {
      "name" : "zipindexer",
      "dataSourceName" : "datasource",
      "targetIndexName" : "zipindex",
      "parameters": {
         "batchSize": null,
         "maxFailedItems": null,
         "maxFailedItemsPerBatch": null,
         "base64EncodeKeys": null,
         "configuration": {
            "indexedFileNameExtensions" : ".zip,.txt,.ini,.vzg",
            "excludedFileNameExtensions" : ".png,.jpeg,.dat,.img"
        }
      },
      "schedule" : { },
      "fieldMappings" : [ ]
    }

【问题讨论】：

标签： azure azure-cognitive-search

【解决方案1】：

在我看来，您实际上超出了索引中字段长度的限制。如果是这种情况，除了将大文本文件分块为较小的文本文件之外，您无能为力。

【讨论】：