使用 Python ijson 增量读取顶级 JSON 字典答案

【问题标题】：Read top-level JSON dictionary incrementally using Python ijson使用 Python ijson 增量读取顶级 JSON 字典
【发布时间】：2021-05-08 21:39:02
【问题描述】：

我的 JSON 文件中有以下数据：

{
    "first": {
        "name": "James",
        "age": 30
    },
    "second": {
        "name": "Max",
        "age": 30
    },
    "third": {
        "name": "Norah",
        "age": 30
    },
    "fourth": {
        "name": "Sam",
        "age": 30
    }
}

我想打印顶级键和对象如下：

import json
import ijson

fname = "data.json"

with open(fname) as f:
    raw_data = f.read()

data = json.loads(raw_data)

for k in data.keys():
    print k, data[k]

输出：

second {u'age': 30, u'name': u'Max'}
fourth {u'age': 30, u'name': u'Sam'}
third {u'age': 30, u'name': u'Norah'}
first {u'age': 30, u'name': u'James'}

所以，到目前为止一切顺利。但是，如果我想为一个巨大的文件做同样的事情，我将不得不在内存中读取它。这非常慢并且需要大量内存。

我想使用增量 JSON 解析器（在本例中为 ijson）来实现我之前描述的：

以上代码摘自：No access to top level elements with ijson?

with open(fname) as f:
    json_obj = ijson.items(f,'').next()  # '' loads everything as only one object.
    for (key, value) in json_obj.items():
        print key + " -> " + str(value)

这也不合适，因为它还会读取内存中的整个文件。这并不是真正的增量。

如何在 Python 中对 JSON 文件的顶级键和相应对象进行增量解析？

【问题讨论】：

标签： python json ijson

【解决方案1】：

从 2.6 版开始，ijson 带有一个 kvitems 函数，可以实现这一点。

【讨论】：

【解决方案2】：

来自github issue的回答[文件名已更改]

import ijson
from ijson.common import ObjectBuilder


def objects(file):
    key = '-'
    for prefix, event, value in ijson.parse(file):
        if prefix == '' and event == 'map_key':  # found new object at the root
            key = value  # mark the key value
            builder = ObjectBuilder()
        elif prefix.startswith(key):  # while at this key, build the object
            builder.event(event, value)
            if event == 'end_map':  # found the end of an object at the current key, yield
                yield key, builder.value


for key, value in objects(open('data.json', 'rb')):
    print(key, value)

【讨论】：

【解决方案3】：

由于 json 文件本质上是文本文件，因此请考虑将顶层剥离为字符串。基本上，使用read file iterable 方法，将字符串与每一行连接起来，然后在字符串包含双括号}} 表示顶层结束时跳出循环。当然，双括号条件必须去掉空格和换行符。

toplevelstring = ''

with open('data.json') as f:    
    for line in f:
        if not '}}' in toplevelstring.replace('\n', '').replace('\s+',''):
            toplevelstring = toplevelstring + line
        else:
            break

data = json.loads(toplevelstring)

现在，如果您的较大 json 包含在方括号或其他大括号中，仍然在例程之上运行，但添加下面的行以切掉第一个字符 [，以及在顶级的最后一个大括号之后的逗号和换行符的最后两个字符：

[{
    "first": {
        "name": "James",
        "age": 30
    },
    "second": {
        "name": "Max",
        "age": 30
    },
    "third": {
        "name": "Norah",
        "age": 30
    },
    "fourth": {
        "name": "Sam",
        "age": 30
    }
},
{
    "data1": {
        "id": "AAA",
        "type": 55
    },
    "data2": {
        "id": "BBB",
        "type": 1601
    },
    "data3": {
        "id": "CCC",
        "type": 817
    }
}]

...

toplevelstring = toplevelstring[1:-2]
data = json.loads(toplevelstring)

【讨论】：

传入的文件可能不是这种格式。还有一个想法是不要一次将整个文件加载到内存中。
请显示其他格式。您可以为正方形/大括号组合的所有可能性添加if 条件。另外，请参阅链接中接受的答案。可迭代方法使用缓冲 IO 和内存管理，因此您不必担心大文件。另外，在解析顶层后循环会中断，因此不会读取到最后一行。