如何将嵌套的 JSON 数据转换为 Pandas 数据框？答案

【问题标题】：How to convert nested JSON data to a Pandas dataframe?如何将嵌套的 JSON 数据转换为 Pandas 数据框？
【发布时间】：2021-01-02 03:46:52
【问题描述】：

这是我目前拥有的 JSON 数据，我需要在 Pandas 数据框中使用这些数据来满足我的需要。

{
  "lab1": {
    "co2": [
      9.559335530495726
    ],
    "occupancy": [
      4
    ],
    "temperature": [
      21.033629524242304
    ],
    "time": "2020-09-15T16:15:35.565629"
  }
}
{
  "class1": {
    "co2": [
      24.168445969175817
    ],
    "occupancy": [
      15
    ],
    "temperature": [
      26.176607611778156
    ],
    "time": "2020-09-15T16:15:36.027525"
  }
}
{
  "office": {
    "co2": [
      6.633787232630541
    ],
    "occupancy": [
      1
    ],
    "temperature": [
      27.727982558797844
    ],
    "time": "2020-09-15T16:15:36.608386"
  }
}

我尝试了json_normalize，但我不明白如何规范化我的 JSON 数据。

with open('data.json','r') as f:
    data = json.loads(f.read())
    # Normalizing data
    data1 = pd.json_normalize(data, record_path =['Results'])
    # Saving to CSV format 
    multiple_level_data.to_csv('multiplelevel_normalized_data.csv', index=False)

我使用这段代码，出现以下错误

JSONDecodeError Traceback（最近一次调用最后一次）在 1 中 open('data.json','r') as f: ----> 2 data = json.loads(f.read()) JSONDecodeError：额外数据：第 14 行第 2 列（字符 240）

【问题讨论】：

请添加您尝试使用 json_normalize 的最小 sn-p 并添加一些关于它如何不符合您的期望的详细信息（您是否收到错误？哪些错误？等等）
with open('data.json','r') as f: data = json.loads(f.read()) # Normalizing data data1 = pd.json_normalize(data, record_path =['Results']) # Saving to CSV format multiple_level_data.to_csv('multiplelevel_normalized_data.csv', index=False) 我使用这段代码，出现以下错误 JSONDecodeError Traceback (最近一次调用最后) in 1 with open('data.json ','r') as f: ----> 2 data = json.loads(f.read()) JSONDecodeError: Extra data: line 14 column 2 (char 240)

标签： json python-3.x pandas dataframe nested

【解决方案1】：

您可以使用 pandas read_json。

首先使用正则表达式从数据中删除所有“[”和“]”。然后转成json文件。

import pandas as pd
pd.read_json (r'Path where you saved the JSON file/filename.json')

【讨论】：

【解决方案2】：

这是一种没有正则表达式的方法。

import pandas as pd

data = [
    {'lab1': {'co2': [9.559335530495726],
              'occupancy': [4],
              'temperature': [21.033629524242304],
              'time': '2020-09-15T16:15:35.565629'}},
    {'class1': {'co2': [24.168445969175817],
                'occupancy': [15],
                'temperature': [26.176607611778156],
                'time': '2020-09-15T16:15:36.027525'}},
    {'office': {'co2': [6.633787232630541],
                'occupancy': [1],
                'temperature': [27.727982558797844],
                'time': '2020-09-15T16:15:36.608386'}}
]

现在遍历字典列表。使用explode() 来展平列表。

df = list()
for d in data:
    for key, values in d.items():
        t = (pd.json_normalize(values)
              .explode('co2')
              .explode('occupancy')
              .explode('temperature')
              .assign(location=key)
             )
        df.append(t)

df = pd.concat(df)
print(df)

       co2 occupancy temperature                        time location
0  9.55934         4     21.0336  2020-09-15T16:15:35.565629     lab1
0  24.1684        15     26.1766  2020-09-15T16:15:36.027525   class1
0  6.63379         1      27.728  2020-09-15T16:15:36.608386   office

最初的问题没有预期的结果，但这个数据框将支持多种类型的进一步分析。

【讨论】：

感谢@jsmart 的及时响应，这对于给定数量的数据包非常有效，但是如果我想读取包含 1000 个数据包的 json 文件怎么办？我尝试使用json.loads()，但我知道它无法加载多个对象json数据（这是我的数据格式），我该如何解决这个问题？
IIUC，您可以编写 Python 代码来“展平”大量数据包，并将结果转换为数据帧。您能发布一个包含更多数据包的示例吗？
我想我们不能在评论中上传 .json 文件。就像我之前贴了3个数据包，一个给lab1，一个给class1，一个给office。同样，对于这三个位置，将有多个具有不同值但格式与我提供的相同的数据包。如果您可以给我您的电子邮件，我可以将 .json 文件发送给您以便更好地理解。 @jsmart
你能发个要点吗？这使它可供 SO 社区使用：gist.github.com
gist.github.com/Fareed99/45b8a39a7e4493243ec973fa73f2b92b 这是要点@jsmart

【解决方案3】：

@Fareed Ahmad 发布了一个更大的数据集。

首先，我们创建两个函数： 1) 将 Gist 文件转换为数据包序列； 2) 将数据包转换为数据帧：

import json
import pandas as pd
import requests

def resp_to_packets(resp_text):
    ''' Convert file to list of packets.'''
    packet = ''
    for line in resp_text.split('\n'):
        if line.startswith('}{'):
            packet += '}'
            yield json.loads(packet)
            packet = '{'
        else:
            packet += line + '\n'
    yield json.loads(packet)

    
def packet_to_df(packet):
    ''' Convert packet to data frame.'''
    df = list()
    for key, values in packet.items():
        t = (pd.json_normalize(values)
              .explode('co2')
              .explode('occupancy')
              .explode('temperature')
              .assign(location=key)
             )
        df.append(t)
    return pd.concat(df, ignore_index=True)

dtypes = {'co2': float, 'occupancy': int, 'temperature': float, 
          'time': 'datetime64', 'location': str}

其次，运行管道，包括连接数据帧和转换类型：

url = 'https://gist.githubusercontent.com/Fareed99/45b8a39a7e4493243ec973fa73f2b92b/raw/4b2eb95c24c95d40733f06d13dbf0356c4520e99/data.json'
r = requests.get(url)
assert r.ok

packets = (packet for packet in resp_to_packets(r.text))
dfs     = (packet_to_df(packet) for packet in packets)
df      = pd.concat(dfs, ignore_index=True).astype(dtype=dtypes)

print(df.tail())

           co2  occupancy  temperature                       time location
476  45.237285         15    27.364173 2020-09-15 20:37:29.252201   class1
477   5.565177          4    21.033565 2020-09-15 20:37:29.667347     lab1
478  10.799228          1    21.014435 2020-09-15 20:37:30.689885     lab1
479  36.989700         20    27.059197 2020-09-15 20:37:33.467733   class1
480   1.836340          2    23.021893 2020-09-15 20:37:35.853943   office

【讨论】：