如何在大型 JSON 文件中查找唯一值？答案

【问题标题】：How to find unique values in a large JSON file?如何在大型 JSON 文件中查找唯一值？
【发布时间】：2014-01-04 08:19:01
【问题描述】：

我有 2 个大小为 data_large(150.1mb) 和 data_small(7.5kb) 的 json 文件。每个文件中的内容都是[{"score": 68},{"score": 78}] 类型。我需要从每个文件中找到唯一分数的列表。

在处理 data_small 时，我执行了以下操作，并且能够通过 0.1 secs 查看其内容。

with open('data_small') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

但是在处理 data_large 时，我执行了以下操作，我的系统被挂起，速度很慢，不得不强制关闭它以使其恢复正常速度。花了大约2 mins 打印它的内容。

with open('data_large') as f:
    content = json.load(f)

print content # I'll be applying the logic to find the unique values later.

如何在处理大型数据集时提高程序的效率？

【问题讨论】：

对于大型 json 文件，请参阅：stackoverflow.com/questions/10382253/… 该答案建议 ijson
@vinod - 我不能使用 python 内置库吗？
json 内置库一次加载整个文件。如果您需要对其进行迭代，那么您将需要手动解析 json 文件，或者只安装像 ijson 这样的库。
@python-coder 只需注释print 语句并使用data_large 执行您的程序
@thefourtheye - 我评论了 print state ，但我需要再次强制关闭我的系统。上帝你会破坏我的系统。

标签： python json

【解决方案1】：

由于您的 json 文件不是那么大，并且您可以一次将其全部打开到 ram 中，因此您可以获得所有唯一值，例如：

with open('data_large') as f:
    content = json.load(f)

# do not print content since it prints it to stdout which will be pretty slow

# get the unique values
values = set()
for item in content:
    values.add(item['score'])

# the above uses less memory compared to this
# since this has to create another array with all values
# and then filter it for unique values
values = set([i['score'] for i in content])

# its faster to save the results to a file rather than print them
with open('results.json', 'wb') as fid:
    # json cant serialize sets hence conversion to list
    json.dump(list(values), fid)

如果您需要处理更大的文件，请寻找其他可以迭代解析 json 文件的库。

【讨论】：

在第二种方法中使用generator expression 可以避免创建一个包含所有值的临时数组——实际上是list。只需使用values = set(i['score'] for i in content)。
谢谢。不知道。
201secs 打印了唯一值。虽然content = ijson.items(f, 'item') 加载速度很快，但print set(i['score'] for i in content) 实际上需要很长时间。这可以提高效率吗？
如果要打印的值很多，总是需要相当长的时间……最好将结果转储回文件中。
@python-coder：你用set([i['score'] for i in content])试过了吗？尽管这会创建一个临时集，但这样做可能会更快，因为使用生成器表达式会在执行时间与内存使用之间进行权衡。另一方面，这可能并不重要，因为瓶颈很可能是所有字符的打印，无论它们是如何生成的——因此 maki725 建议将它们写入文件将是输出结果的最快方式。 -- 这必须是你最终想要实现的目标。

【解决方案2】：

如果您想以较小的块迭代 JSON 文件以保留 RAM，我建议采用以下方法，基于您不想使用 ijson 执行此操作的评论。这仅适用于您的示例输入数据非常简单，并且由具有一个键和一个值的字典数组组成。更复杂的数据会使它变得复杂，那时我会使用实际的 JSON 流媒体库。

import json

bytes_to_read = 10000
unique_scores = set()

with open('tmp.txt') as f:
chunk = f.read(bytes_to_read)
while chunk:
    # Find indices of dictionaries in chunk
    if '{' not in chunk:
        break
    opening = chunk.index('{')
    ending = chunk.rindex('}')

    # Load JSON and set scores.
    score_dicts = json.loads('[' + chunk[opening:ending+1] + ']')
    for s in score_dicts:
        unique_scores.add(s.values()[0])

    # Read next chunk from last processed dict.
    f.seek(-(len(chunk) - ending) + 1, 1)
    chunk = f.read(bytes_to_read)
print unique_scores

【讨论】：

好吧，我试过了，但打印唯一值仍然需要很长时间。 f = open ('data_large') content = ijson.items(f, 'item') print set(i['score'] for i in content)
201secs 打印了唯一值。虽然content = ijson.items(f, 'item') 加载速度很快，但print set(i['score'] for i in content) 实际上需要很长时间。这可以提高效率吗？