如何为 Elasticsearch 映射/索引嵌套的 Twitter 数据 (json)答案

【问题标题】：How to map/index nested Twitter data (json) for Elasticsearch如何为 Elasticsearch 映射/索引嵌套的 Twitter 数据 (json)
【发布时间】：2017-09-19 08:07:23
【问题描述】：

我从 Twitter 收集了一个大数据集。该文件 (twitter.json) 包含如下行：

    [ {"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":853397569807958016,"id_str":"853397569807958016","text":"\u3042\u3048\u3066\u8a00\u3046\u3051\u3069\u3001\u6642\u9593\u3060\u3088\uff01\u4f55\u304b\u3059\u308b\u3053\u3068\u3001\u3042\u3063\u305f\u3093\u3058\u3083\u306a\u3044\uff1f(\u30a8\u30b3\u30ed)","source":"\u003ca href=\"http:\/\/makebot.sh\" rel=\"nofollow\"\u003e\u3077\u3088\u3077\u3088\u30c9\u30e9\u30deCDbot\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2230278991,"id_str":"2230278991","name":"\u3077\u3088\u3077\u3088\u304a\u307e\u3051\u30dc\u30a4\u30b9bot","screen_name":"puyo_cd_bot","location":null,"url":"http:\/\/twpf.jp\/puyo_cd_bot","description":"\u53ea\u4eca\u591a\u5fd9\u306e\u305f\u3081\u66f4\u65b0\u304c\u505c\u6ede\u3057\u3066\u3044\u307e\u3059\u3001\u3054\u4e86\u627f\u304f\u3060\u3055\u3044\u3002\u3077\u3088\u3077\u3088\u30c9\u30e9\u30decd\u306e\u304a\u307e\u3051\u30dc\u30a4\u30b9\u306e\u5b9a\u671f\u3064\u3076\u3084\u304d\u3001\u4e00\u90e8\u30ea\u30d7\u30e9\u30a4\u3067\u306e\u53cd\u5fdc\u3092\u8003\u3048\u3066\u3044\u307e\u3059\u3002\u975e\u516c\u5f0f\u3002\u767b\u9332\u6e08\u307f\u30ad\u30e3\u30e9\u306a\u3069\u8a73\u3057\u304f\u306f\u3064\u3044\u3077\u308d\u306b\u3066","protected":false,"verified":false,"followers_count":181,"friends_count":115,"listed_count":3,"favourites_count":0,"statuses_count":44139,"created_at":"Wed Dec 04 17:43:08 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"ja","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"ja","timestamp_ms":"1492300810659"}
    , { ... }
    ...  
    , { ... } 
    ]

To give you a better visual, the 1st tweet line looks like this after validation:

   {"created_at": "Sun Apr 16 00:00:10 +0000 2017",
    "id": 853397569807958016,
    "id_str": "853397569807958016",
    "text": "\u3042\u3048\u3066\u8a00\u3046\u3051\u3069\u3001\u6642\u9593\u3060\u3088\uff01\u4f55\u304b\u3059\u308b\u3053\u3068\u3001\u3042\u3063\u305f\u3093\u3058\u3083\u306a\u3044\uff1f(\u30a8\u30b3\u30ed)",
    "source": "\u003ca href=\"http:\/\/makebot.sh\" rel=\"nofollow\"\u003e\u3077\u3088\u3077\u3088\u30c9\u30e9\u30deCDbot\u003c\/a\u003e",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 2230278991,
        "id_str": "2230278991",
        "name": "\u3077\u3088\u3077\u3088\u304a\u307e\u3051\u30dc\u30a4\u30b9bot",
        "screen_name": "puyo_cd_bot",
        "location": null,
        "url": "http:\/\/twpf.jp\/puyo_cd_bot",
        "description": "\u53ea\u4eca\u591a\u5fd9\u306e\u305f\u3081\u66f4\u65b0\u304c\u505c\u6ede\u3057\u3066\u3044\u307e\u3059\u3001\u3054\u4e86\u627f\u304f\u3060\u3055\u3044\u3002\u3077\u3088\u3077\u3088\u30c9\u30e9\u30decd\u306e\u304a\u307e\u3051\u30dc\u30a4\u30b9\u306e\u5b9a\u671f\u3064\u3076\u3084\u304d\u3001\u4e00\u90e8\u30ea\u30d7\u30e9\u30a4\u3067\u306e\u53cd\u5fdc\u3092\u8003\u3048\u3066\u3044\u307e\u3059\u3002\u975e\u516c\u5f0f\u3002\u767b\u9332\u6e08\u307f\u30ad\u30e3\u30e9\u306a\u3069\u8a73\u3057\u304f\u306f\u3064\u3044\u3077\u308d\u306b\u3066",
        "protected": false,
        "verified": false,
        "followers_count": 181,
        "friends_count": 115,
        "listed_count": 3,
        "favourites_count": 0,
        "statuses_count": 44139,
        "created_at": "Wed Dec 04 17:43:08 +0000 2013",
        "utc_offset": null,
        "time_zone": null,
        "geo_enabled": false,
        "lang": "ja",
        "contributors_enabled": false,
        "is_translator": false,
        "profile_background_color": "C0DEED",
        "profile_background_image_url": "http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_image_url_https": "https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_tile": false,
        "profile_link_color": "1DA1F2",
        "profile_sidebar_border_color": "C0DEED",
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_text_color": "333333",
        "profile_use_background_image": true,
        "profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg",
        "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg",
        "default_profile": true,
        "default_profile_image": false,
        "following": null,
        "follow_request_sent": null,
        "notifications": null
    },
    "geo": null,
    "coordinates": null,
    "place": null,
    "contributors": null,
    "is_quote_status": false,
    "retweet_count": 0,
    "favorite_count": 0,
    "entities": {
        "hashtags": [],
        "urls": [],
        "user_mentions": [],
        "symbols": []
    },
    "favorited": false,
    "retweeted": false,
    "filter_level": "low",
    "lang": "ja",
    "timestamp_ms": "1492300810659"
   }

问题：

我尝试使用以下命令行将此 .json 文件导入到 elasticsearch：

curl -XPOST 'http://localhost:9200/twitter/tweet/1' --data-binary "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json"

但它给了我这个错误：

**{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}**

我也使用了以下代码，但仍然失败：

curl --header "Content-Type:application/json"  -XPOST 'http://localhost:9200/twitter/tweet/1' --data-binary "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json"

错误信息是：

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}: {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

我也尝试将其更改为 -d，但再次失败：

curl -XPOST 'http://localhost:9200/twitter/tweet/1' -d "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json"

错误信息与使用 --data-binary 的错误信息相同：

更新：

由于使用 curl 太麻烦，我决定使用 Python 的 elasticsearch 库。成功连接到本地主机后，我使用这样的东西来索引示例数据：

es = Elasticsearch([{"host": "localhost", "port":9200}])
with open('sample0.json') as json_data:
    json_docs = json.load(json_data)
    for json_doc in json_docs:
        my_id = json_doc.pop('_id', None)
        es.index(index='testdata', doc_type='generated', id=my_id, body=json.dumps(json_doc))

错误： C:\Anaconda2\lib\json\decoder.pyc in raw_decode(self, s, idx) 第378章 379 尝试： --> 380 obj, end = self.scan_once(s, idx) 381 除了停止迭代： 382 raise ValueError("No JSON object could be decoded")

ValueError: Expecting , delimiter: line 1 column 2241 (char 2240)

有人可以给我一些指导吗？谢谢！

【问题讨论】：

能否提供line1 char 2240左右的json文件内容？

标签： javascript json parsing object elasticsearch

【解决方案1】：

您可以提供一个标头来指示请求是 JSON 格式：

curl --header "Content-Type:application/json" ...

或者，您可以使用“-d”代替“--data-binary”。

正如这里所解释的：https://stackoverflow.com/a/35213617/5520709 使用 {"root":[...]} 嵌入您的数组以获得有效的 json 对象

请注意，这会将您的整个 json 索引为单个文档，这可能不是您想要的。如果您想为每条推文索引一个文档，您可能需要使用批量 API： - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

【讨论】：

嗨 Damien，感谢您的回答，但在我的终端上都没有工作。我发布了错误消息并更新了我上面的问题。请看看:)
好的，对于这个新错误（not_x_content_exception），您可以在这里找到解决方案：stackoverflow.com/a/35213617/5520709（使用 {root:[...]} 嵌入您的数组）
感谢达米安。刚试了下还是不行：{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception"," reason":"failed to parse","caused_by":{"type":"json_parse_exception","reason":"意外字符 ('r' (code 114)): 期望用双引号来开始字段名称\n在 [来源：org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@72bbeb6b；行：1，列：3]"}},"status":400}
好的，试试 {"root":[...]}
（请注意，这会将您的整个 json 索引为单个文档，这可能不是您想要的。如果您想为每条推文索引一个文档，您可能需要使用批量 API：@987654324 @)

【解决方案2】：

您可以通过将每一行与, 符号连接并用方括号 ([]) 将其括起来，就像这样将其转换为数组

'[' + s.join(',') + ']'

如果您需要单独验证它们，您应该将s[i] 传递给JSON.parse 函数而不是sTemp。

更新

如果您需要创建一个字符串以传递给 ElasticSearch，您应该将您的 JSON 对象列表转换为以下文件：

{"index":{"_index":"my_index","_type":"tweet","_id":null}}
{"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":1, ... }
{"index":{"_index":"my_index","_type":"tweet","_id":null}}
{"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":2, ... }

并将其内容传递给 ElasticSearch，以便在每个文档行之前您都有命令行来索引您的文档

{"index":{"_index":"my_index","_type":"tweet","_id":null}}

看这里https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

【讨论】：

嗨@Random，感谢您的及时回答！非常有帮助。我现在将我的文本更改为 [ {...}, {...}] 之类的东西，并且它起作用了。然而，另一个问题出现了。请查看我更新的问题！
看来您在这里有很多单独的问题。如果需要放映射，先看这里elastic.co/guide/en/elasticsearch/reference/current/…，如果需要插入数据看bulk API。或者更详细地说明当前的问题是什么
感谢您的评论。是的，问题出在推文上。每一行都是一个嵌套的 json。由于命令行 curl 给出了这么多问题，我想知道是否可以在 Python 中使用 elasticsearch 库？请参阅我上面更新的帖子。再次感谢！
用curl就可以了。但是您需要预处理文件以获得 ElasticSearch 的正确输入。查看更新的答案