【问题标题】:How to map/index nested Twitter data (json) for Elasticsearch如何为 Elasticsearch 映射/索引嵌套的 Twitter 数据 (json)
【发布时间】:2017-09-19 08:07:23
【问题描述】:

我从 Twitter 收集了一个大数据集。该文件 (twitter.json) 包含如下行:

    [ {"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":853397569807958016,"id_str":"853397569807958016","text":"\u3042\u3048\u3066\u8a00\u3046\u3051\u3069\u3001\u6642\u9593\u3060\u3088\uff01\u4f55\u304b\u3059\u308b\u3053\u3068\u3001\u3042\u3063\u305f\u3093\u3058\u3083\u306a\u3044\uff1f(\u30a8\u30b3\u30ed)","source":"\u003ca href=\"http:\/\/makebot.sh\" rel=\"nofollow\"\u003e\u3077\u3088\u3077\u3088\u30c9\u30e9\u30deCDbot\u003c\/a\u003e","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2230278991,"id_str":"2230278991","name":"\u3077\u3088\u3077\u3088\u304a\u307e\u3051\u30dc\u30a4\u30b9bot","screen_name":"puyo_cd_bot","location":null,"url":"http:\/\/twpf.jp\/puyo_cd_bot","description":"\u53ea\u4eca\u591a\u5fd9\u306e\u305f\u3081\u66f4\u65b0\u304c\u505c\u6ede\u3057\u3066\u3044\u307e\u3059\u3001\u3054\u4e86\u627f\u304f\u3060\u3055\u3044\u3002\u3077\u3088\u3077\u3088\u30c9\u30e9\u30decd\u306e\u304a\u307e\u3051\u30dc\u30a4\u30b9\u306e\u5b9a\u671f\u3064\u3076\u3084\u304d\u3001\u4e00\u90e8\u30ea\u30d7\u30e9\u30a4\u3067\u306e\u53cd\u5fdc\u3092\u8003\u3048\u3066\u3044\u307e\u3059\u3002\u975e\u516c\u5f0f\u3002\u767b\u9332\u6e08\u307f\u30ad\u30e3\u30e9\u306a\u3069\u8a73\u3057\u304f\u306f\u3064\u3044\u3077\u308d\u306b\u3066","protected":false,"verified":false,"followers_count":181,"friends_count":115,"listed_count":3,"favourites_count":0,"statuses_count":44139,"created_at":"Wed Dec 04 17:43:08 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":false,"lang":"ja","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_image_url_https":"https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png","profile_background_tile":false,"profile_link_color":"1DA1F2","profile_sidebar_border_color":"C0DEED","profile_sidebar_fill_color":"DDEEF6","profile_text_color":"333333","profile_use_background_image":true,"profile_image_url":"http:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg","profile_image_url_https":"https:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg","default_profile":true,"default_profile_image":false,"following":null,"follow_request_sent":null,"notifications":null},"geo":null,"coordinates":null,"place":null,"contributors":null,"is_quote_status":false,"retweet_count":0,"favorite_count":0,"entities":{"hashtags":[],"urls":[],"user_mentions":[],"symbols":[]},"favorited":false,"retweeted":false,"filter_level":"low","lang":"ja","timestamp_ms":"1492300810659"}
    , { ... }
    ...  
    , { ... } 
    ]

To give you a better visual, the 1st tweet line looks like this after validation:

   {"created_at": "Sun Apr 16 00:00:10 +0000 2017",
    "id": 853397569807958016,
    "id_str": "853397569807958016",
    "text": "\u3042\u3048\u3066\u8a00\u3046\u3051\u3069\u3001\u6642\u9593\u3060\u3088\uff01\u4f55\u304b\u3059\u308b\u3053\u3068\u3001\u3042\u3063\u305f\u3093\u3058\u3083\u306a\u3044\uff1f(\u30a8\u30b3\u30ed)",
    "source": "\u003ca href=\"http:\/\/makebot.sh\" rel=\"nofollow\"\u003e\u3077\u3088\u3077\u3088\u30c9\u30e9\u30deCDbot\u003c\/a\u003e",
    "truncated": false,
    "in_reply_to_status_id": null,
    "in_reply_to_status_id_str": null,
    "in_reply_to_user_id": null,
    "in_reply_to_user_id_str": null,
    "in_reply_to_screen_name": null,
    "user": {
        "id": 2230278991,
        "id_str": "2230278991",
        "name": "\u3077\u3088\u3077\u3088\u304a\u307e\u3051\u30dc\u30a4\u30b9bot",
        "screen_name": "puyo_cd_bot",
        "location": null,
        "url": "http:\/\/twpf.jp\/puyo_cd_bot",
        "description": "\u53ea\u4eca\u591a\u5fd9\u306e\u305f\u3081\u66f4\u65b0\u304c\u505c\u6ede\u3057\u3066\u3044\u307e\u3059\u3001\u3054\u4e86\u627f\u304f\u3060\u3055\u3044\u3002\u3077\u3088\u3077\u3088\u30c9\u30e9\u30decd\u306e\u304a\u307e\u3051\u30dc\u30a4\u30b9\u306e\u5b9a\u671f\u3064\u3076\u3084\u304d\u3001\u4e00\u90e8\u30ea\u30d7\u30e9\u30a4\u3067\u306e\u53cd\u5fdc\u3092\u8003\u3048\u3066\u3044\u307e\u3059\u3002\u975e\u516c\u5f0f\u3002\u767b\u9332\u6e08\u307f\u30ad\u30e3\u30e9\u306a\u3069\u8a73\u3057\u304f\u306f\u3064\u3044\u3077\u308d\u306b\u3066",
        "protected": false,
        "verified": false,
        "followers_count": 181,
        "friends_count": 115,
        "listed_count": 3,
        "favourites_count": 0,
        "statuses_count": 44139,
        "created_at": "Wed Dec 04 17:43:08 +0000 2013",
        "utc_offset": null,
        "time_zone": null,
        "geo_enabled": false,
        "lang": "ja",
        "contributors_enabled": false,
        "is_translator": false,
        "profile_background_color": "C0DEED",
        "profile_background_image_url": "http:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_image_url_https": "https:\/\/abs.twimg.com\/images\/themes\/theme1\/bg.png",
        "profile_background_tile": false,
        "profile_link_color": "1DA1F2",
        "profile_sidebar_border_color": "C0DEED",
        "profile_sidebar_fill_color": "DDEEF6",
        "profile_text_color": "333333",
        "profile_use_background_image": true,
        "profile_image_url": "http:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg",
        "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/378800000837231449\/beca2ed4c8ce917b37dcbe188d0f9e31_normal.jpeg",
        "default_profile": true,
        "default_profile_image": false,
        "following": null,
        "follow_request_sent": null,
        "notifications": null
    },
    "geo": null,
    "coordinates": null,
    "place": null,
    "contributors": null,
    "is_quote_status": false,
    "retweet_count": 0,
    "favorite_count": 0,
    "entities": {
        "hashtags": [],
        "urls": [],
        "user_mentions": [],
        "symbols": []
    },
    "favorited": false,
    "retweeted": false,
    "filter_level": "low",
    "lang": "ja",
    "timestamp_ms": "1492300810659"
   }

问题:

我尝试使用以下命令行将此 .json 文件导入到 elasticsearch:

curl -XPOST 'http://localhost:9200/twitter/tweet/1' --data-binary "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json" 

但它给了我这个错误:

**{"error":"Content-Type header [application/x-www-form-urlencoded] is not supported","status":406}**

我也使用了以下代码,但仍然失败:

curl --header "Content-Type:application/json"  -XPOST 'http://localhost:9200/twitter/tweet/1' --data-binary "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json" 

错误信息是:

{"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}: {"error":{"root_cause":[{"type":"mapper_parsing_exception","reason":"failed to parse"}],"type":"mapper_parsing_exception","reason":"failed to parse","caused_by":{"type":"not_x_content_exception","reason":"Compressor detection can only be called on some xcontent bytes or compressed xcontent bytes"}},"status":400}

我也尝试将其更改为 -d,但再次失败:

curl -XPOST 'http://localhost:9200/twitter/tweet/1' -d "@/Users/jz/Documents/elasticsearch-5.3.0/twitter.json" 

错误信息与使用 --data-binary 的错误信息相同:

更新:

由于使用 curl 太麻烦,我决定使用 Python 的 elasticsearch 库。成功连接到本地主机后,我使用这样的东西来索引示例数据:

es = Elasticsearch([{"host": "localhost", "port":9200}])
with open('sample0.json') as json_data:
    json_docs = json.load(json_data)
    for json_doc in json_docs:
        my_id = json_doc.pop('_id', None)
        es.index(index='testdata', doc_type='generated', id=my_id, body=json.dumps(json_doc))

错误: C:\Anaconda2\lib\json\decoder.pyc in raw_decode(self, s, idx) 第378章 379 尝试: --> 380 obj, end = self.scan_once(s, idx) 381 除了停止迭代: 382 raise ValueError("No JSON object could be decoded")

ValueError: Expecting , delimiter: line 1 column 2241 (char 2240)

有人可以给我一些指导吗?谢谢!

【问题讨论】:

  • 能否提供line1 char 2240左右的json文件内容?

标签: javascript json parsing object elasticsearch


【解决方案1】:

您可以提供一个标头来指示请求是 JSON 格式:

curl --header "Content-Type:application/json" ...

或者,您可以使用“-d”代替“--data-binary”。

正如这里所解释的:https://stackoverflow.com/a/35213617/5520709 使用 {"root":[...]} 嵌入您的数组以获得有效的 json 对象

请注意,这会将您的整个 json 索引为单个文档,这可能不是您想要的。如果您想为每条推文索引一个文档,您可能需要使用批量 API: - https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

【讨论】:

  • 嗨 Damien,感谢您的回答,但在我的终端上都没有工作。我发布了错误消息并更新了我上面的问题。请看看:)
  • 好的,对于这个新错误(not_x_content_exception),您可以在这里找到解决方案:stackoverflow.com/a/35213617/5520709(使用 {root:[...]} 嵌入您的数组)
  • 感谢达米安。刚试了下还是不行:{"error":{"root_cause":[{"type":"mapper_parsing_exception","re​​ason":"failed to parse"}],"type":"mapper_parsing_exception"," reason":"failed to parse","caused_by":{"type":"json_parse_exception","re​​ason":"意外字符 ('r' (code 114)): 期望用双引号来开始字段名称\n在 [来源:org.elasticsearch.common.bytes.BytesReference$MarkSupportingStreamInputWrapper@72bbeb6b;行:1,列:3]"}},"status":400}
  • 好的,试试 {"root":[...]}
  • (请注意,这会将您的整个 json 索引为单个文档,这可能不是您想要的。如果您想为每条推文索引一个文档,您可能需要使用批量 API:@987654324 @)
【解决方案2】:

您可以通过将每一行与, 符号连接并用方括号 ([]) 将其括起来,就像这样将其转换为数组

'[' + s.join(',') + ']'

如果您需要单独验证它们,您应该将s[i] 传递给JSON.parse 函数而不是sTemp

更新

如果您需要创建一个字符串以传递给 ElasticSearch,您应该将您的 JSON 对象列表转换为以下文件:

{"index":{"_index":"my_index","_type":"tweet","_id":null}}
{"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":1, ... }
{"index":{"_index":"my_index","_type":"tweet","_id":null}}
{"created_at":"Sun Apr 16 00:00:10 +0000 2017","id":2, ... }

并将其内容传递给 ElasticSearch,以便在每个文档行之前您都有命令行来索引您的文档

{"index":{"_index":"my_index","_type":"tweet","_id":null}}

看这里https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html

【讨论】:

  • 嗨@Random,感谢您的及时回答!非常有帮助。我现在将我的文本更改为 [ {...}, {...}] 之类的东西,并且它起作用了。然而,另一个问题出现了。请查看我更新的问题!
  • 看来您在这里有很多单独的问题。如果需要放映射,先看这里elastic.co/guide/en/elasticsearch/reference/current/…,如果需要插入数据看bulk API。或者更详细地说明当前的问题是什么
  • 感谢您的评论。是的,问题出在推文上。每一行都是一个嵌套的 json。由于命令行 curl 给出了这么多问题,我想知道是否可以在 Python 中使用 elasticsearch 库?请参阅我上面更新的帖子。再次感谢!
  • 用curl就可以了。但是您需要预处理文件以获得 ElasticSearch 的正确输入。查看更新的答案
猜你喜欢
  • 2018-11-17
  • 1970-01-01
  • 1970-01-01
  • 2016-07-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-09-26
  • 2018-03-09
相关资源
最近更新 更多