尝试加载多个 json 文件并合并到一个 pandas 数据帧中答案

【问题标题】：Trying to load multiple json files and merge into one pandas dataframe尝试加载多个 json 文件并合并到一个 pandas 数据帧中
【发布时间】：2019-01-22 19:57:13
【问题描述】：

我正在尝试将多个 json 文件从我的 Google Drive 中的一个目录加载到一个 pandas 数据帧中。

我尝试了很多解决方案，但似乎都没有产生积极的结果。

这是我迄今为止尝试过的

path_to_json = '/path/'
json_files = [pos_json for pos_json in os.listdir(path_to_json) if pos_json.endswith('.json')]
jsons_data = pd.DataFrame(columns=['participants','messages','active','threadtype','thread path'])
for index, js in enumerate(json_files):
    with open(os.path.join(path_to_json, js)) as json_file:
        json_text = json.load(json_file)
        participants = json_text['participants']
        messages = json_text['messages']
        active = json_text['is_still_participant']
        threadtype = json_text['thread_type']
        threadpath = json_text['thread_path']
        jsons_data.loc[index]=[participants,messages,active,threadtype,threadpath]
jsons_data

这是我收到的错误消息的完整追溯：

---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)
<ipython-input-30-8385abf6a3a7> in <module>()
      1 for index, js in enumerate(json_files):
      2     with open(os.path.join(path_to_json, js)) as json_file:
----> 3         json_text = json.load(json_file)
      4         participants = json_text['participants']
      5         messages = json_text['messages']

/usr/lib/python3.6/json/__init__.py in load(fp, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    297         cls=cls, object_hook=object_hook,
    298         parse_float=parse_float, parse_int=parse_int,
--> 299         parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)
    300 
    301 

/usr/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
    352             parse_int is None and parse_float is None and
    353             parse_constant is None and object_pairs_hook is None and not kw):
--> 354         return _default_decoder.decode(s)
    355     if cls is None:
    356         cls = JSONDecoder

/usr/lib/python3.6/json/decoder.py in decode(self, s, _w)
    337 
    338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    340         end = _w(s, end).end()
    341         if end != len(s):

/usr/lib/python3.6/json/decoder.py in raw_decode(self, s, idx)
    355             obj, end = self.scan_once(s, idx)
    356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
    358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

我添加了一个我试图从中读取的 json 文件的示例

Link to Jsons

json 示例：

{
participants: [
{
name: "Test 1"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Person",
timestamp_ms: 1485467319139,
content: "Hie",
type: "Generic"
}
],
title: "Test 1",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/xyz"
}
#second example
{
participants: [
{
name: "Clearance"
},
{
name: "Person"
}
],
messages: [
{
sender_name: "Emmanuel Sibanda",
timestamp_ms: 1212242073308,
content: "Dear",
share: {
link: "http://www.example.com/"
},
type: "Share"
}
],
title: "Clearance",
is_still_participant: true,
thread_type: "Regular",
thread_path: "inbox/Clearance"
}

【问题讨论】：

您可以编辑您的问题并显示文件中的示例 JSON，而不是添加指向 JSON 文件的链接。
@amanb 添加了 2 个 JSON 文件示例
这些是格式错误的 JSONS。键不是字符串，一些值也不是字符串，最终可能会出错。您应该找到一种使用JSON validator 验证它们的方法。
您能否分享您希望看到的数据框的示例输出？你期待什么形状？一个示例 JSONS 的某种表格表示应该是理想的！

标签： python json pandas

【解决方案1】：

在使用您提供的 JSON 文件时遇到了一些挑战，然后将它们转换为数据帧并进行了合并。这是因为 JSON 的键不是字符串，其次，生成的“有效” JSONS 的数组长度不同，无法直接转换为数据帧，第三，您没有指定数据帧的形状。

尽管如此，这是一个重要的问题，因为格式错误的 JSON 比“有效”的 JSON 更常见，尽管有几个 SO 答案可以修复此类 JSON 字符串，但每个格式错误的 JSON 问题都是独一无二的。

我将问题分解为以下几个部分：

将文件中格式错误的 JSON 转换为有效的 JSON
扁平化有效 JSON 文件中的 dict 以准备数据帧转换
从文件中创建数据帧并合并到一个数据帧中

注意：对于这个答案，我将您提供的示例 JSON 字符串复制到两个文件中，即“test.json”和“test1.json”，并将它们保存到“Test”文件夹中。

第 1 部分：将文件中格式错误的 JSON 转换为有效的 JSON：

您提供的两个示例 JSON 字符串没有任何数据类型。这是因为键不是字符串并且是无效的。所以，即使你加载 JSON 文件并解析内容，也会出现错误。

with open('./Test/test.json') as f:
    data = json.load(f)
print(data)
#Error:
JSONDecodeError: Expecting property name enclosed in double quotes: line 2 column 1 (char 2)

我发现解决此问题的唯一方法是：

将所有 JSON 文件转换为 txt 文件，因为这会将内容转换为字符串
对文本文件中的 JSON 字符串执行正则表达式并在键周围添加引号（“”）
再次将文件另存为 JSON

以上三个步骤是通过我编写的两个函数完成的。第一个将文件重命名为 txt 文件并返回文件名列表。第二个接受这个文件名列表，使用正则表达式修复 JSON 键，并再次将它们保存为 JSON 格式。

import json
import os
import re 
import pandas as pd

#rename to txt files and return list of filenames
def rename_to_text_files():
    all_new_filenames = []
    for filename in os.listdir('./Test'):
        if filename.endswith("json"):
            new_filename = filename.split('.')[0] + '.txt'   
            os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
            all_new_filenames.append(new_filename)
        else:
            all_new_filenames.append(filename)
    return all_new_filenames     

#fix JSON string and save as a JSON file again, returns a list of valid JSON filenames
def fix_dict_rename_to_json_files(files):
    json_validated_files = []  
    for index, filename in enumerate(files):
        filepath = os.path.join('./Test',filename)
        with open(filepath,'r+') as f:
            data = f.read()            
            dict_converted = re.sub("(\w+):(.+)", r'"\1":\2', data)
            f.seek(0)
            f.write(dict_converted)
            f.truncate()
    #rename            
        new_filename = filename[:-4] + '.json'  
        os.rename(os.path.join('./Test', filename), os.path.join('./Test', new_filename))
        json_validated_files.append(new_filename)        
    print("All files converted to valid JSON!")        
    return json_validated_files

所以，现在我有两个带有有效 JSON 的 JSON 文件。但是他们还没有准备好进行数据帧转换。为了更好地解释事情，请考虑来自“test.json”的有效 JSON：

#test.json
{
"participants": [
{
"name": "Test 1"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1485467319139,
"content": "Hie",
"type": "Generic"
}
],
"title": "Test 1",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/xyz"
}

如果我将 json 读入数据帧，我仍然会收到错误消息，因为每个键的数组长度不同。您可以检查一下：“messages”键值是一个长度为 1 的数组，而“participants”的值是一个长度为 2 的数组：

df = pd.read_json('./Test/test.json')
print(df)
#Error
ValueError: arrays must all be same length

在下一部分中，我们通过展平 JSON 中的 dict 来解决这个问题。

第 2 部分：扁平化 dict 以进行数据帧转换：

由于您没有为数据框指定您期望的形状，我以尽可能最好的方式提取了这些值，并使用以下函数将 dict 展平。这是假设示例 JSON 中提供的键不会在所有 JSON 文件中更改：

#accepts a dictionary, flattens as required and returns the dictionary with updated key/value pairs
def flatten(d):
    values = []
    d['participants_name'] = d.pop('participants')
    for i in d['participants_name']:
        values.append(i['name'])
    for i in d['messages']:
        d['messages_sender_name'] = i['sender_name']
        d['messages_timestamp_ms'] = str(i['timestamp_ms'])
        d['messages_content'] = i['content']
        d['messages_type'] = i['type']
        if "share" in i:
            d['messages_share_link'] = i["share"]["link"]
    d["is_still_participant"] = str(d["is_still_participant"])
    d.pop('messages')
    d.update(participants_name=values)                    
    return d

这次让我们考虑第二个示例 JSON 字符串，它也有一个带有 URL 的“共享”键。有效的 JSON 字符串如下：

#test1.json
{
"participants": [
{
"name": "Clearance"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Emmanuel Sibanda",
"timestamp_ms": 1212242073308,
"content": "Dear",
"share": {
"link": "http://www.example.com/"
},
"type": "Share"
}
],
"title": "Clearance",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clearance"
}

当我们用上面的函数展平这个dict时，我们得到一个可以很容易地输入到DataFrame函数中的dict（稍后讨论）：

with open('./Test/test1.json') as f:
    data = json.load(f)

print(flatten(data))
#Output:
    {'title': 'Clearance',
 'is_still_participant': 'True',
 'thread_type': 'Regular',
 'thread_path': 'inbox/Clearance',
 'participants_name': ['Clearance', 'Person'],
 'messages_sender_name': 'Emmanuel Sibanda',
 'messages_timestamp_ms': '1212242073308',
 'messages_content': 'Dear',
 'messages_type': 'Share',
 'messages_share_link': 'http://www.example.com/'}

第 3 部分：创建数据框并将它们合并为一个：

现在我们有了一个可以展平字典的函数，我们可以在最终函数中调用这个函数：

一一打开 JSON 文件，使用 json.load() 将每个 JSON 作为字典加载到内存中。
在每个字典上调用 flatten 函数
将扁平化的字典转换为数据帧
将所有数据框附加到一个空列表。
将所有数据帧与pd.concat() 合并，将数据帧列表作为参数传递。

完成这些任务的代码：

#accepts a list of valid json filenames, creates dataframes from flattened dicts in the JSON files, merges the dataframes and returns the merged dataframe.

def create_merge_dataframes(list_of_valid_json_files):
    df_list = []
    for index, js in enumerate(list_of_valid_json_files):
        with open(os.path.join('./Test', js)) as json_file:  
            data = json.load(json_file)
            flattened_json_data = flatten(data)    
            df = pd.DataFrame(flattened_json_data)
            df_list.append(df)
    merged_df = pd.concat(df_list,sort=False, ignore_index=True)
    return merged_df

让我们测试一下整个代码。我们从第 1 部分中的函数开始，到第 3 部分结束，以获得合并的 ddataframe。

#rename invalid JSON files to text
files = rename_to_text_files()

#fix JSON strings and save as JSON files again. We pass the "files" variable above as an arg for this function
json_validated_files = fix_dict_rename_to_json_files(files)

#flatten and receive merged dataframes
df = create_merge_dataframes(json_validated_files)
print(df)

最终的数据框：

        title is_still_participant thread_type      thread_path  \
0     Test 1                 True     Regular        inbox/xyz
1     Test 1                 True     Regular        inbox/xyz
2  Clearance                 True     Regular  inbox/Clearance
3  Clearance                 True     Regular  inbox/Clearance

  participants_name messages_sender_name messages_timestamp_ms  \
0            Test 1               Person         1485467319139
1            Person               Person         1485467319139
2         Clearance     Emmanuel Sibanda         1212242073308
3            Person     Emmanuel Sibanda         1212242073308

  messages_content messages_type      messages_share_link
0              Hie       Generic                      NaN
1              Hie       Generic                      NaN
2             Dear         Share  http://www.example.com/
3             Dear         Share  http://www.example.com/

您可以随意更改列的顺序。

注意：

该代码没有异常处理，并假定您的示例中显示的 dicts 的键是相同的
数据框的形状和列也已假定
您可以将所有函数添加到一个 Python 脚本中，并且在 JSON 文件夹路径使用“./Test”的任何地方，您都应该输入您的路径。该文件夹应仅包含以邮件格式开头的 JSON 文件。
通过将函数放入一个类中，可以进一步模块化整个脚本。
还可以使用元组等可散列数据类型进一步优化它，并使用threading 和asyncio 库加速。但是，对于一个包含 1000 个文件的文件夹，此代码应该可以很好地运行，并且不会花费很长时间。
在将格式错误的 JSON 文件转换为有效文件时，可能会出现一些错误，因为所有 JSON 文件的内容都是未知的。

所讨论的代码提供了一个工作流程来完成您需要的工作，我希望这可以帮助您和遇到类似问题的任何人。

【讨论】：

【解决方案2】：

我检查了你的json文件，发现document1.json、document2.json和document3.json有同样的问题：属性名没有用双引号括起来。

例如，document1.json 应更正为：

{
"participants": [
{
"name": "Clothing"
},
{
"name": "Person"
}
],
"messages": [
{
"sender_name": "Person",
"timestamp_ms": 1210107456233,
"content": "Good day",
"type": "Generic"
}
],
"title": "Clothing",
"is_still_participant": true,
"thread_type": "Regular",
"thread_path": "inbox/Clothing"
}

编辑：您可以使用以下行将双引号添加到 json 文件的键：

re.sub("([^\s^\"]+):(.+)", '"\\1":\\2', s)

【讨论】：

我明白了……问题是我从我的 Facebook 中提取了大约 1000 个 json 文件。我必须手动进行这些更改吗？
正则表达式字符串很有帮助！
嗨@Emm，如果这个或任何答案已经解决了您的问题，请考虑接受它。