有没有办法在elasticsearch服务器中导入一个json文件（包含100个文档）。？答案

【问题标题】：is there any way to import a json file(contains 100 documents) in elasticsearch server.?有没有办法在elasticsearch服务器中导入一个json文件（包含100个文档）。？
【发布时间】：2014-01-05 23:07:51
【问题描述】：

有什么方法可以在 elasticsearch 服务器中导入 JSON 文件（包含 100 个文档）？我想将一个大的 json 文件导入 es-server..

【问题讨论】：

我知道批量 API，但我不想使用批量 API，因为它需要手动编辑字段和模式。我想一次性上传 json 文件。我使用了 bulk-api，但它需要手动编辑。我想按原样导入我的 json。无论如何感谢您的回复。我得到了stream2es（用于流输入）和FSRiver在某种程度上这些对我有用-------------------------------- ------------------------------------------

标签： json elasticsearch

【解决方案1】：

不导入，但您可以使用 ES API 对文档进行索引。

您可以使用索引 api 加载每一行（使用某种代码来读取文件并进行 curl 调用）或索引批量 api 来加载它们。假设您的数据文件可以格式化以使用它。

Read more here : ES API

如果您对类似这样的 shell 感到满意（未测试），那么一个简单的 shell 脚本就可以解决问题：

while read line
do
curl -XPOST 'http://localhost:9200/<indexname>/<typeofdoc>/' -d "$line"
done <myfile.json

就个人而言，我可能会使用 Python pyes 或弹性搜索客户端。

pyes on github
elastic search python client

Stream2es 对于快速将数据加载到 es 中也非常有用，并且可能有一种方法可以简单地将文件流式输入。（我没有测试过文件，但已经使用它来加载 wikipedia doc 以进行 es 性能测试）

【讨论】：

这是最简单的方法
感谢 mconlin 的宝贵回答，
您应该解决的几件事：正确的方法是 POST，因为 PUT 端点希望您指定一个 ID (reference)； '$line' 将替换为 $line 字面意思，应该是 "$line" 否则谢谢，这正是我想要的。
所做的编辑 - 感谢对上述未经测试的 sn-p 的代码审查。

【解决方案2】：

您应该使用Bulk API。请注意，您需要在每个 json 文档之前添加一个标题行。

$ cat requests
{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -XPOST localhost:9200/_bulk --data-binary @requests; echo
{"took":7,"items":[{"create":{"_index":"test","_type":"type1","_id":"1","_version":1,"ok":true}}]}

【讨论】：

标题行？..你能解释一下那部分吗？
这是一个标题{ "index" : { "_index" : "test", "_type" : "type1", "_id" : "1" } }
不是。如果你使用index/type/_bulk端点，你也可以忽略_index和_type。
如果我们删除了_id、_index和_type，那么该对象将为空，如下所示：{ "index" : { } }。可以吗？
仅供参考（以防万一有人遇到这个问题），是的，它适用于空索引标题，如 @The Red Pea 所写

【解决方案3】：

我确定有人想要这个，所以我会很容易找到。

仅供参考 - 这是在与全新 ES 实例相同的服务器上使用 Node.js（本质上作为批处理脚本）。在 2 个文件上运行它，每个文件有 4000 个项目，在我的共享虚拟服务器上只用了大约 12 秒。 YMMV

var elasticsearch = require('elasticsearch'),
    fs = require('fs'),
    pubs = JSON.parse(fs.readFileSync(__dirname + '/pubs.json')), // name of my first file to parse
    forms = JSON.parse(fs.readFileSync(__dirname + '/forms.json')); // and the second set
var client = new elasticsearch.Client({  // default is fine for me, change as you see fit
  host: 'localhost:9200',
  log: 'trace'
});

for (var i = 0; i < pubs.length; i++ ) {
  client.create({
    index: "epubs", // name your index
    type: "pub", // describe the data thats getting created
    id: i, // increment ID every iteration - I already sorted mine but not a requirement
    body: pubs[i] // *** THIS ASSUMES YOUR DATA FILE IS FORMATTED LIKE SO: [{prop: val, prop2: val2}, {prop:...}, {prop:...}] - I converted mine from a CSV so pubs[i] is the current object {prop:..., prop2:...}
  }, function(error, response) {
    if (error) {
      console.error(error);
      return;
    }
    else {
    console.log(response);  //  I don't recommend this but I like having my console flooded with stuff.  It looks cool.  Like I'm compiling a kernel really fast.
    }
  });
}

for (var a = 0; a < forms.length; a++ ) {  // Same stuff here, just slight changes in type and variables
  client.create({
    index: "epubs",
    type: "form",
    id: a,
    body: forms[a]
  }, function(error, response) {
    if (error) {
      console.error(error);
      return;
    }
    else {
    console.log(response);
    }
  });
}

希望我能帮助的不仅仅是我自己。不是火箭科学，但可能会为某人节省 10 分钟。

干杯

【讨论】：

这里有些东西我听不懂。这不会使pubs.length + forms.length 产生不同的操作吗？而不是只有一个，这是_bulk 的重点？我找到了this thread，@keety 的答案使用client.bulk() 将所有内容插入到一个操作中，这更有意义 IMO
@JeremyThille 确实是更好的方法，在我写这篇文章的时候，我要么没有在文档中走那么远，要么还没有选择，这对我的非常具体的用例。现在我根本不使用 JS 客户端，而是直接调用 /_bulk 并结合所有数据。

【解决方案4】：

Stream2es 是 IMO 最简单的方法。

例如假设文件“some.json”包含 JSON 文档列表，每行一个：

curl -O download.elasticsearch.org/stream2es/stream2es; chmod +x stream2es
cat some.json | ./stream2es stdin --target "http://localhost:9200/my_index/my_type

【讨论】：

第二行正确吗？我不得不使用stdin 命令，像这样：cat some.json | ./stream2es stdin --target http://localhost:9200/myindex/mytype

【解决方案5】：

正如 dadoonet 已经提到的，批量 API 可能是要走的路。要为批量协议转换文件，您可以使用jq。

假设文件只包含文档本身：

$ echo '{"foo":"bar"}{"baz":"qux"}' | 
jq -c '
{ index: { _index: "myindex", _type: "mytype" } },
. '

{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}

如果文件包含顶级列表中的文档，则必须先将其解包：

$ echo '[{"foo":"bar"},{"baz":"qux"}]' | 
jq -c '
.[] |
{ index: { _index: "myindex", _type: "mytype" } },
. '

{"index":{"_index":"myindex","_type":"mytype"}}
{"foo":"bar"}
{"index":{"_index":"myindex","_type":"mytype"}}
{"baz":"qux"}

jq 的-c 标志确保每个文档都在一行上。

如果你想直接用管道传递 curl，你需要使用--data-binary @-，而不仅仅是-d，否则 curl 会再次删除换行符。

【讨论】：

这个答案非常有帮助。仅根据您的解释，我就能弄清楚如何使这项工作正常进行-如果我可以投票两次，我会的！
感谢您提供有关使用 --data-binary 的提示 - 完美回答了我的问题。
很遗憾，ElasticSearch 没有为巨大的 JSON 文件导入 OOTB 提供一流的支持（jq 对于 Windows 用户来说是不可行的，它有点 hacky）。

【解决方案6】：

您可以使用esbulk，一个快速简单的批量索引器：

$ esbulk -index myindex file.ldj

这是一个asciicast，显示它在大约 11 秒内将 Project Gutenberg 数据加载到 Elasticsearch。

免责声明：我是作者。

【讨论】：

【解决方案7】：

您可以使用 Elasticsearch Gatherer 插件

Elasticsearch 的 Gatherer 插件是一个可扩展的数据获取和索引框架。内容适配器在 Gatherer zip 档案中实现，这是一种特殊的插件，可分布在 Elasticsearch 节点上。他们可以接收作业请求并在本地队列中执行它们。作业状态保存在一个特殊的索引中。

此插件正在开发中。

里程碑 1 - 将收集器 zip 部署到节点

里程碑 2 - 工作规范和执行

里程碑 3 - 将 JDBC River 移植到 JDBC Gatherer

里程碑 4 - 按负载/队列长度/节点名称、cron 作业分配收集器作业

里程碑 5 - 更多收集者，更多内容适配器

参考https://github.com/jprante/elasticsearch-gatherer

【讨论】：

【解决方案8】：

一种方法是创建一个执行批量插入的 bash 脚本：

curl -XPOST http://127.0.0.1:9200/myindexname/type/_bulk?pretty=true --data-binary @myjsonfile.json

运行插入后，运行以下命令获取计数：

curl http://127.0.0.1:9200/myindexname/type/_count

【讨论】：

【解决方案9】：

jq 是一个轻量级且灵活的命令行 JSON 处理器。

用法：

cat file.json | jq -c '.[] | {"index": {"_index": "bookmarks", "_type": "bookmark", "_id": .id}}, .' | curl -XPOST localhost:9200/_bulk --data-binary @-

我们使用文件 file.json 并首先使用 -c 标志将其内容管道传输到 jq 以构造紧凑输出。重点是：我们利用了 jq 每行输入不仅可以构造一个对象，而且可以构造多个对象这一事实。对于每一行，我们正在创建 Elasticsearch 需要的控制 JSON（使用我们原始对象的 ID）并创建第二行，它只是我们的原始 JSON 对象 (.)。

此时，我们的 JSON 格式已按照 Elasticsearch 的批量 API 所期望的方式进行了格式化，因此我们只需将其通过管道传递给 curl，然后再将其 POST 到 Elasticsearch！

归功于Kevin Marsh

【讨论】：

对此非常感谢。这是一个很棒的答案