以字符串格式读取 JSON 字节文字答案

【问题标题】：Reading JSON bytes literal in string format以字符串格式读取 JSON 字节文字
【发布时间】：2021-07-31 15:24:02
【问题描述】：

我正在阅读以前生成的 jsonl 文件。但是，我以某种方式保存了编码的字节文字，它显示为'b'{"foo": "Don\\u2019t", "bar": "bar"}'，它的类型是str，而不是b'{"foo": "Don\\u2019t", "bar": "bar"}'，它是一个字节文字。如何将其加载为字典{"foo": "Don't", "bar": "bar"}？

编辑：我做的和@snakecharmerb 完全一样，但是加载{"created_at": "Fri Jan 24 03:22:44 +0000 2020", "id": 1220547456024416256, "id_str": "1220547456024416256", "full_text": "@Aviation_Intel Don\\u2019t forget the Wuhan virus \\ud83e\\udda0", "truncated": false, "display_text_range": [16, 46], "entities": {"hashtags": [], "symbols": [], "user_mentions": [{"screen_name": "Aviation_Intel", "name": "Tyler Rogoway", "id": 613212190, "id_str": "613212190", "indices": [0, 15]}], "urls": []}, "source": "<a href=\\"http://twitter.com/download/iphone\\" rel=\\"nofollow\\">Twitter for iPhone</a>", "in_reply_to_status_id": 1220388498873573376, "in_reply_to_status_id_str": "1220388498873573376", "in_reply_to_user_id": 613212190, "in_reply_to_user_id_str": "613212190", "in_reply_to_screen_name": "Aviation_Intel", "user": {"id": 2415399587, "id_str": "2415399587", "name": "Te Sheng Lin \\ud83c\\uddfa\\ud83c\\uddf8", "screen_name": "teshen8lin", "location": "United States", "description": "Opinions Are My Own. You should not treat any opinion expressed by me as a specific inducement to make a particular investment or follow a particular strategy.", "url": null, "entities": {"description": {"urls": []}}, "protected": false, "followers_count": 209, "friends_count": 688, "listed_count": 5, "created_at": "Sun Mar 16 14:58:14 +0000 2014", "favourites_count": 5191, "utc_offset": null, "time_zone": null, "geo_enabled": true, "verified": false, "statuses_count": 14372, "lang": null, "contributors_enabled": false, "is_translator": false, "is_translation_enabled": false, "profile_background_color": "C0DEED", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "profile_background_tile": false, "profile_image_url": "http://pbs.twimg.com/profile_images/1364327906210627585/6k3p7DI5_normal.jpg", "profile_image_url_https": "https://pbs.twimg.com/profile_images/1364327906210627585/6k3p7DI5_normal.jpg", "profile_banner_url": "https://pbs.twimg.com/profile_banners/2415399587/1614119635", "profile_image_extensions_alt_text": null, "profile_banner_extensions_alt_text": null, "profile_link_color": "1DA1F2", "profile_sidebar_border_color": "C0DEED", "profile_sidebar_fill_color": "DDEEF6", "profile_text_color": "333333", "profile_use_background_image": true, "has_extended_profile": false, "default_profile": true, "default_profile_image": false, "following": false, "follow_request_sent": false, "notifications": false, "translator_type": "none"}, "geo": null, "coordinates": null, "place": {"id": "fa9e955670752b3c", "url": "https://api.twitter.com/1.1/geo/id/fa9e955670752b3c.json", "place_type": "city", "name": "Tenafly", "full_name": "Tenafly, NJ", "country_code": "US", "country": "United States", "contained_within": [], "bounding_box": {"type": "Polygon", "coordinates": [[[-73.9845718, 40.899734], [-73.927398, 40.899734], [-73.927398, 40.937822], [-73.9845718, 40.937822]]]}, "attributes": {}}, "contributors": null, "is_quote_status": false, "retweet_count": 0, "favorite_count": 0, "favorited": false, "retweeted": false, "lang": "en"} 时json.loads 似乎失败了，它显示JSONDecodeError: Expecting ',' delimiter: line 1 column 462 (char 461)

【问题讨论】：

标签： python json unicode encode

【解决方案1】：

对于一次性情况，可以通过删除前导 b' 和尾随 ' 并将双反斜杠更改为单反斜杠来修复数据

"<a href=\\"http://twitter.com/download/iphone\\" rel=\\"nofollow\\">

可以通过使用ast.literal_eval 将字符串化字节转换为字节实例来以编程方式完成：

import ast, json

# bad_data => single json line
bs = ast.literal_eval(bad_data)
json_data = bs.decode('utf-8') 
data = json.loads(json_data)

注意字节的编码是假设的；它可能不是 UTF-8。

这两种方法都是解决方法。 解决方案是修复生成这种畸形数据的上游程序。

请注意，当在字节实例上调用 str 时，您可以使用 -b 或 -bb 标志调用 Python 以发出警告或引发异常。

$ python -b -c 'str(b"")'
<string>:1: BytesWarning: str() on a bytes instance
$ python -bb -c 'str(b"")'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
BytesWarning: str() on a bytes instance

【讨论】：

这非常有用！谢谢！

【解决方案2】：

编辑：如另一个答案所述，您可以使用json.loads()、as in the doc 直接加载字节数据。

如果我理解正确，您应该将字节数据解码为字符串，然后使用json.loads() 加载它。然后你会有一本字典。所以是这样的：

import json

json_string = b'{"foo": "Don\\u2019t", "bar": "bar"}'.decode("utf-8")
dictionary = json.loads(json_string)

【讨论】：

【解决方案3】：

除非我遗漏了什么或者你没有在问题中提出什么，否则你可以将字节提供给 json.loads 来构建字典

import json

jstring = b'{"foo": "Don\\u2019t", "bar": "bar"}'
print(json.loads(jstring))

输出

{'foo': 'Don’t', 'bar': 'bar'}

【讨论】：

感谢您的回答，但似乎我没有清楚地阐述我的问题。请查看我编辑的问题。谢谢！
我已经阅读了您编辑的问题，但它仍然没有任何意义。你有一个字节对象，你想要它作为一个字典。这正是 json.loads 正在做的事情