如何将特殊格式的数据导入 Elasticsearch？答案

【问题标题】：How to import special-format data into Elasticsearch?如何将特殊格式的数据导入 Elasticsearch？
【发布时间】：2021-01-11 03:01:41
【问题描述】：

我收到了一个 15 GB 的 .txt 文件，格式如下：

{
  "_score": 1.0,
  "_index": "newsvit",
  "_source": {
    "content": " \u0641\u0647\u06cc\u0645\u0647 \u062d\u0633\u0646\u200c\u0645\u06cc\u0631\u06cc:  ",
    "title": "\u06a9\u0627\u0631\u0647\u0627\u06cc \u0642\u0627\u0644\u06cc\u0628\u0627\u0641 ",
    "lead": "\u062c\u0627\u0645\u0639\u0647&nbsp;&gt;&nbsp;\u0634\u0647\u0631\u06cc - 
    \u0645\u06cc\u0632\u06af\u0631\u062f\u06cc \u062f\u0631\u0628\u0627\u0631\u0647 .",
    "agency": "13",
    "date_created": 1494518193,
    "url": "http://www.khabaronline.ir/(X(1)S(bud4wg3ebzbxv51mj45iwjtp))/detail/663749/society/urban",
    "image": "uploads/2017/05/11/1589793661.jpg",
    "category": "15"
  },
  "_type": "news",
  "_id": "2981643"
}
{
  "_score": 1.0,
  "_index": "newsvit",
  "_source": {
    "content": "\u0645/\u0630",
    "title": "\u0645\u0639\u0646\u0648\u06cc\u062a \u062f\u0631 \u0639\u0635\u0631 ",
    "lead": "\u0645\u062f\u06cc\u0631 \u0645\u0624\u0633\u0633\u0647 \u0639\u0644\u0645\u06cc \u0648 \u067e\u0698\u0648\u0647\u0634\u06cc \u0627\u0628\u0646\u200c\u0633\u06cc\u0646\u0627 \u062f\u0631 .",
    "agency": "1",
    "date_created": 1494521817,
    "url": "http://www.farsnews.com/13960221001386",
    "image": "uploads/2017/05/11/1713799235.jpg",
    "category": "20"
  },
  "_type": "news",
  "_id": "2981951"
}
....

我想将它导入elasticsearch。我尝试过 BulkAPI，但由于它只接受特定样式的 JSON，我无法将整个 15 GB 文件转换为 Bulk 格式。我也尝试过 logstash，但像 content 这样的字段将无法搜索和查询。

将此文件导入elasticsearch最有效的方法是什么？

【问题讨论】：

标签： json elasticsearch

【解决方案1】：

首先，这似乎是来自名为 news 的 ElasticSearch 索引的类似 JSON 的导出。因此，如果您仍然可以访问/可以访问该索引，则它是从远程集群到您自己的 possible to reindex。

话虽如此，我建议编写一个脚本（例如在 python 中，也可以是 bash）将其转换为 json。您的文件已经几乎像 json - 它只是缺少包装括号 [{}, {}, {}] 和分隔各个对象的逗号。

您可以在任何文本文件quite easily 中添加[ 和]。或者，也有文本编辑器可以神奇地打开如此大的文件——对于 Mac，例如 HexFiend。可能还有一个 windows 替代方案。

完成后，您可以编写一个脚本来通过正则表达式分割文本文件，例如/^\}$/gm -- testable here。拆分后，您可以使用逗号字符 , 和

将其重新加入

另存为.json，然后使用批量@data-binary 选项
或使用批量 DSL API -- example again in python

开始使用 python 加载器脚本：

import json
import re

# load the large text file 
text_str = '...'


my_json_list = map(json.loads,
             # ^^^ iterate & convert to a python dict all objects
                   [
                       # that are contained in the split list
                       lineitem for lineitem in re.split(
                           # by replacing '\n'
                           "\n",
                           # that we introduced after replacing the
                           # current object separators, i.e. '}{' with '\n'
                           re.sub(r'}{', '}\n{',
                                  # after getting rid of the original
                                  # line-breaks
                                  re.sub(r'\n', '', text_str)))
                   ]
            )

【讨论】：

我应该如何通过正则表达式分割文件？我尝试将文件转换为字符串，然后使用 re.split 但它没有完成。你能设置一个代码示例吗？
事实证明，仅用}\n{ 分割并不容易，因为您的文本文件确实包含一些在 json 中实际上不允许的其他换行符。以上是我正在谈论的re.split 示例。希望对您有所帮助。