将json文件合并为一个的最佳方法答案

【问题标题】：Best way to combine json files into one将json文件合并为一个的最佳方法
【发布时间】：2019-12-19 00:18:36
【问题描述】：

我在下面提到的文件夹结构中的不同文件夹中有同名的 json 文件

folder1/
    file1.json
    file2.json
    file3.json
folder2/
    file1.json
    file2.json
    file3.json
    file4.json
folder3/
    file1.json
    file2.json
    file3.json
    file4.json
    file5.json
....

结合所有文件夹中可用的 json 文件以创建单个 json 文件的最佳方法是什么。 file1.json 中的键在它存在的所有文件夹中都是唯一的

到目前为止，我可以想到以下方法，但感觉很慢，因为每个 json 文件大约 5 MB。

from pathlib import Path

output_dir = Path(location_of_output_folder)
output_dir.mkdir(parents=True, exist_ok=True)

# find all the folders
root_dir = Path(root_location_for_folders)
folders = [fld for fld in root_dir.iterdir() if fld.is_dir()]

# find all the unique file names
all_filenames = []
for fld in folders:
    for f in fld.glob('*.json'):
        all_filenames.append(f.name)


## Approach 1
# Join file that possibly exists across all the folders by creating empty list
for f in list(set(all_filenames)):
    f_data = []

    for fld in folders:
        if (fld / f).is_file():
           with open(fld /f, 'r') as fp:
               f_data.append(json.load(fp))

    with open(output_dir / f, 'w') as fp:
        json.dump(f_data, fp, indent=4)


## Approach 2
# Join file that possibly exists across all the folders by creating empty dict
for f in list(set(all_filenames)):
    f_data = {}

    for fld in folders:
        if (fld / f).is_file():
           with open(fld /f, 'r') as fp:
               f_data.update(json.load(fp))

    with open(output_dir / f, 'w') as fp:
        json.dump(f_data, fp, indent=4)

有没有更好（更快）的方法。我只担心时间，对pythonic解决方案不感兴趣

谢谢

更新#1：应该合并具有相同文件名的文件。对不起，如果我不清楚。每个文件将有几个键 (l1, l2, l3, l4) 与所有文件相似

例子

一个。 folder1中file1.json的结构

一个。 folder2中file2.json的结构

【问题讨论】：

我认为没有什么比加载所有文件然后转储结果更好的了。试图通过直接合并文件来做到这一点是行不通的，因为它不尊重 JSON 语法。
你能分享一些示例 JSON 吗？ 我只是担心时间，对 pythonic 解决方案不感兴趣 咦，这是为什么呢？
等一下，您要合并同名文件吗？您的帖子中不清楚。
@AMC 我更新了问题以包含更多信息
对不起，但仍不清楚：'folder1' 中的'file1.json' 有键 'k1，k2..' 和 'folder2' 中的 'file2.json' 有不同的键。好的，但是'folder2' 和其他文件夹中的 'file1.json' 是否与第一个 'file1.json' 具有相同的键？如果是这种情况，“合并”是什么意思，是否应该为每个相应的键连接所有值?

标签： python json

【解决方案1】：

您不需要解析输入的 JSON 文件，只需将它们作为文本文件读取，这样会快得多（基本上每个文件一个系统调用）。然后通过在开头添加[，在末尾添加]，并在每个文件内容后添加,，将它们组合成一个全局JSON列表。好的，对于 0 级列表，这些行不会缩进，但是谁在乎呢？这是一个框架实现：

infiles = [...] # the whole list of input JSON files
outfile = 'out.json'

with open(outfile,'w') as o:
    o.write('[')
    for infile in infiles[:-1]: # loop over all files except the last one
        with open(infile,'r') as i:
            o.write(i.read().strip() + ',\n')
    with open(infiles[-1]) as i: # special treatement for last file
        o.write(i.read().strip() + ']\n')

请注意，此实现将输入文件一个接一个地存储在 RAM 中，因此与其他方法相反，很容易处理很长的文件列表。

最后一点：如果您真的想要所有内部行的缩进，您可以简单地逐行读取每个文件（对文件使用 readline() 方法）并添加前缀在写入输出文件之前 4 个空格。但是你会失去性能......

编辑：稍作修改的版本，包含更多代码分解

infiles = [...] # the whole list of input JSON files
outfile = 'out.json'
end, n = (']\n', ',\n'), len(infiles)

with open(outfile, 'w') as o:
  o.write('[')
  for infile in infiles:
    n -= 1
    with open(infile, 'r') as i:
      o.write(i.read().strip() + end[n>0]) # select correct end separator

【讨论】：

是json.load 和json.dump 比i.read 和o.write 慢吗？
@RTM：我做了一些快速的基准测试：对于复杂的 JSON 结构，与json.load() 相比，i.read() 的速度可能提高 2 倍到 2.5 倍。
进一步的基准测试：对于非常复杂（27k 行）的 JSON 文件，甚至可以将速度提高 5 倍。

【解决方案2】：

这是我能想到的最简单的代码：

from glob import glob
from os import makedirs, path
from pathlib import Path
import json

# Directories
input_dir = "in"
output_file = "out/out.json"

# Get array of files
files = glob(path.join(input_dir, "**", "*.json"))

# Data object
data = {}

# Merge all files
for file in files:
    data.update(json.load(open(file)))

# Create output directory
makedirs(path.dirname(output_file), exist_ok=True)

# Dump data
json.dump(data, open(output_file, "w+"))

【讨论】：

dict(data, **json.load(open(file))) 不能替换为data.update() 吗？另外，您为什么不对该文件使用上下文管理器？
@Richie 抱歉，这个问题没有更清楚。我希望只在所有文件夹中添加 file1.json 的内容，在所有文件夹中添加 file2.json 等等

【解决方案3】：

编辑：我知道这个解决方案不再符合要求，我会尽快更新。

暂时不考虑这是否重要，这就是我想出的。

import glob
import json

file_names = glob.glob('../resources/json_files/*.json')

json_list = []

for curr_f_name in file_names:
    with open(curr_f_name) as curr_f_obj:
        json_list.append(json.load(curr_f_obj))

with open('../out/json_merge_out.json', 'w') as out_file:
    json.dump(json_list, out_file, indent=4)

包含的JSON文件目录：

example_1.json:

{
    "fruit": "Apple",
    "size": "Large",
    "color": "Red"
}

example_2.json:

{
    "quiz": {
        "sport": {
            "q1": {
                "question": "Which one is correct team name in NBA?",
                "options": [
                    "New York Bulls",
                    "Los Angeles Kings",
                    "Golden State Warriros",
                    "Huston Rocket"
                ],
                "answer": "Huston Rocket"
            }
        },
        "maths": {
            "q1": {
                "question": "5 + 7 = ?",
                "options": [
                    "10",
                    "11",
                    "12",
                    "13"
                ],
                "answer": "12"
            },
            "q2": {
                "question": "12 - 8 = ?",
                "options": [
                    "1",
                    "2",
                    "3",
                    "4"
                ],
                "answer": "4"
            }
        }
    }
}

输出文件的内容，json_merge_out.json:

[
    {
        "quiz": {
            "sport": {
                "q1": {
                    "question": "Which one is correct team name in NBA?",
                    "options": [
                        "New York Bulls",
                        "Los Angeles Kings",
                        "Golden State Warriros",
                        "Huston Rocket"
                    ],
                    "answer": "Huston Rocket"
                }
            },
            "maths": {
                "q1": {
                    "question": "5 + 7 = ?",
                    "options": [
                        "10",
                        "11",
                        "12",
                        "13"
                    ],
                    "answer": "12"
                },
                "q2": {
                    "question": "12 - 8 = ?",
                    "options": [
                        "1",
                        "2",
                        "3",
                        "4"
                    ],
                    "answer": "4"
                }
            }
        }
    },
    {
        "fruit": "Apple",
        "size": "Large",
        "color": "Red"
    }
]

【讨论】：

【解决方案4】：

如果您真的对时间感兴趣，请直接使用 C++ 或 C。就像 @Barmar 在评论中所说，我认为它们不会对您的设置进行太多优化，因为您需要打开所有文件反正

【讨论】：