【问题标题】:Reading JSON bytes literal in string format以字符串格式读取 JSON 字节文字
【发布时间】:2021-07-31 15:24:02
【问题描述】:

我正在阅读以前生成的 jsonl 文件。但是,我以某种方式保存了编码的字节文字,它显示为'b'{"foo": "Don\\u2019t", "bar": "bar"}',它的类型是str,而不是b'{"foo": "Don\\u2019t", "bar": "bar"}',它是一个字节文字。如何将其加载为字典{"foo": "Don't", "bar": "bar"}

编辑:我做的和@snakecharmerb 完全一样,但是加载{"created_at": "Fri Jan 24 03:22:44 +0000 2020", "id": 1220547456024416256, "id_str": "1220547456024416256", "full_text": "@Aviation_Intel Don\\u2019t forget the Wuhan virus \\ud83e\\udda0", "truncated": false, "display_text_range": [16, 46], "entities": {"hashtags": [], "symbols": [], "user_mentions": [{"screen_name": "Aviation_Intel", "name": "Tyler Rogoway", "id": 613212190, "id_str": "613212190", "indices": [0, 15]}], "urls": []}, "source": "<a href=\\"http://twitter.com/download/iphone\\" rel=\\"nofollow\\">Twitter for iPhone</a>", "in_reply_to_status_id": 1220388498873573376, "in_reply_to_status_id_str": "1220388498873573376", "in_reply_to_user_id": 613212190, "in_reply_to_user_id_str": "613212190", "in_reply_to_screen_name": "Aviation_Intel", "user": {"id": 2415399587, "id_str": "2415399587", "name": "Te Sheng Lin \\ud83c\\uddfa\\ud83c\\uddf8", "screen_name": "teshen8lin", "location": "United States", "description": "Opinions Are My Own. You should not treat any opinion expressed by me as a specific inducement to make a particular investment or follow a particular strategy.", "url": null, "entities": {"description": {"urls": []}}, "protected": false, "followers_count": 209, "friends_count": 688, "listed_count": 5, "created_at": "Sun Mar 16 14:58:14 +0000 2014", "favourites_count": 5191, "utc_offset": null, "time_zone": null, "geo_enabled": true, "verified": false, "statuses_count": 14372, "lang": null, "contributors_enabled": false, "is_translator": false, "is_translation_enabled": false, "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": false, "profile_image_url": "http://pbs.twimg.com/profile_images/1364327906210627585/6k3p7DI5_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1364327906210627585/6k3p7DI5_normal.jpg", "profile_banner_url": "https://pbs.twimg.com/profile_banners/2415399587/1614119635", "profile_image_extensions_alt_text": null, "profile_banner_extensions_alt_text": null, "profile_link_color": "1DA1F2", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "has_extended_profile": false, "default_profile": true, "default_profile_image": false, "following": false, "follow_request_sent": false, "notifications": false, "translator_type": "none"}, "geo": null, "coordinates": null, "place": {"id": "fa9e955670752b3c", "url": "https://api.twitter.com/1.1/geo/id/fa9e955670752b3c.json", "place_type": "city", "name": "Tenafly", "full_name": "Tenafly, NJ", "country_code": "US", "country": "United States", "contained_within": [], "bounding_box": {"type": "Polygon", "coordinates": [[[-73.9845718, 40.899734], [-73.927398, 40.899734], [-73.927398, 40.937822], [-73.9845718, 40.937822]]]}, "attributes": {}}, "contributors": null, "is_quote_status": false, "retweet_count": 0, "favorite_count": 0, "favorited": false, "retweeted": false, "lang": "en"} 时json.loads 似乎失败了,它显示JSONDecodeError: Expecting ',' delimiter: line 1 column 462 (char 461)

【问题讨论】:

    标签: python json unicode encode


    【解决方案1】:

    对于一次性情况,可以通过删除前导 b' 和尾随 ' 并将双反斜杠更改为单反斜杠来修复数据

    "<a href=\\"http://twitter.com/download/iphone\\" rel=\\"nofollow\\">
    

    可以通过使用ast.literal_eval 将字符串化字节转换为字节实例来以编程方式完成:

    import ast, json
    
    # bad_data => single json line
    bs = ast.literal_eval(bad_data)
    json_data = bs.decode('utf-8') 
    data = json.loads(json_data)
    

    注意字节的编码是假设的;它可能不是 UTF-8。

    这两种方法都是解决方法解决方案是修复生成这种畸形数据的上游程序。

    请注意,当在字节实例上调用 str 时,您可以使用 -b-bb 标志调用 Python 以发出警告或引发异常。

    $ python -b -c 'str(b"")'
    <string>:1: BytesWarning: str() on a bytes instance
    $ python -bb -c 'str(b"")'
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
    BytesWarning: str() on a bytes instance
    

    【讨论】:

    • 这非常有用!谢谢!
    【解决方案2】:

    编辑:如另一个答案所述,您可以使用json.loads()as in the doc 直接加载字节数据。

    如果我理解正确,您应该将字节数据解码为字符串,然后使用json.loads() 加载它。然后你会有一本字典。所以是这样的:

    import json
    
    json_string = b'{"foo": "Don\\u2019t", "bar": "bar"}'.decode("utf-8")
    dictionary = json.loads(json_string)
    

    【讨论】:

      【解决方案3】:

      除非我遗漏了什么或者你没有在问题中提出什么,否则你可以将字节提供给 json.loads 来构建字典

      import json
      
      jstring = b'{"foo": "Don\\u2019t", "bar": "bar"}'
      print(json.loads(jstring))
      

      输出

      {'foo': 'Don’t', 'bar': 'bar'}
      

      【讨论】:

      • 感谢您的回答,但似乎我没有清楚地阐述我的问题。请查看我编辑的问题。谢谢!
      • 我已经阅读了您编辑的问题,但它仍然没有任何意义。你有一个字节对象,你想要它作为一个字典。这正是 json.loads 正在做的事情
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-02-28
      • 2012-10-06
      • 1970-01-01
      • 2021-09-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多