【问题标题】:Pandas: Retrieving nested data from JSON FilePandas:从 JSON 文件中检索嵌套数据
【发布时间】:2026-02-05 20:50:01
【问题描述】:

我正在解析来自 here 的嵌套 JSON 数据。此文件中的某些文件与多个committee_id 关联。我需要与每个文件相关的所有委员会。我不确定,但我想这意味着为每个committee_id 写一个新行。我的代码如下:

import os.path
import csv
import json

path = '/home/jayaramdas/anaconda3/Thesis/govtrack/bills109/hr'
dirs = os.listdir(path)
outputfile = open('df/h109_s_b', 'w', newline='')                            
outputwriter = csv.writer(outputfile)

for dir in dirs:
    with open(path + "/" + dir + "/data.json", "r") as f:
        data = json.load(f)

        a = data['introduced_at']
        b = data['bill_id']
        c = data['sponsor']['thomas_id']
        d = data['sponsor']['state']
        e = data['sponsor']['name']
        f = data['sponsor']['type']
        i = data['subjects_top_term']   
        j = data['official_title']               

        if data['committees']:
            g = data['committees'][0]['committee_id']
        else:
            g = "None"                      
    outputwriter.writerow([a, b, c, d, e, f, g, i, j])
outputfile.close()       

我遇到的问题是我的代码只收集列出的第一个committee_id。例如,文件hr145 如下所示:

 "committees": [
{
  "activity": [
    "referral", 
    "in committee"
  ], 
  "committee": "House Transportation and Infrastructure", 
  "committee_id": "HSPW"
}, 
{
  "activity": [
    "referral"
  ], 
  "committee": "House Transportation and Infrastructure", 
  "committee_id": "HSPW", 
  "subcommittee": "Subcommittee on Economic Development, Public Buildings and Emergency Management", 
  "subcommittee_id": "13"
}, 
{
  "activity": [
    "referral", 
    "in committee"
  ], 
  "committee": "House Financial Services", 
  "committee_id": "HSBA"
}, 
{
  "activity": [
    "referral"
  ], 
  "committee": "House Financial Services", 


  "committee_id": "HSBA", 
  "subcommittee": "Subcommittee on Domestic and International Monetary Policy, Trade, and Technology", 
  "subcommittee_id": "19"
}

这是有点棘手的地方,因为当法案通过小组委员会时,我还希望 subcommittee_idcommittee_id 相关联:

bill_iid    committee   subcommittee    introduced at   Thomas_id   state   name
hr145-109   HSPW          na             "2005-01-4"         73      NY "McHugh, John M."
hr145-109   HSPW          13             "2005-01-4"         73      NY "McHugh, John M."
hr145-109   HSBA          na             "2005-01-4"         73      NY "McHugh, John M."
hr145-109   HSBA          19             "2005-01-4"         73      NY "McHugh, John M."

有什么想法吗?

【问题讨论】:

    标签: python json pandas dataframe


    【解决方案1】:

    你可以这样做:

    In [111]: with open(fn) as f:
       .....:     data = ujson.load(f)
       .....:
    
    In [112]: committees = pd.io.json.json_normalize(data, 'committees')
    
    In [113]: committees
    Out[113]:
                 activity                                committee committee_id                            subcommittee subcommittee_id
    0          [referral]                House Energy and Commerce         HSIF                                     NaN             NaN
    1          [referral]                House Energy and Commerce         HSIF  Subcommittee on Energy and Air Quality              03
    2          [referral]        House Education and the Workforce         HSED                                     NaN             NaN
    3          [referral]                 House Financial Services         HSBA                                     NaN             NaN
    4          [referral]                        House Agriculture         HSAG                                     NaN             NaN
    5  [referral, markup]                          House Resources         HSII                                     NaN             NaN
    6          [referral]                            House Science         HSSY                                     NaN             NaN
    7          [referral]                     House Ways and Means         HSWM                                     NaN             NaN
    8          [referral]  House Transportation and Infrastructure         HSPW                                     NaN             NaN
    

    更新:

    如果您想将所有数据放在一个 DF 中,您可以这样做:

    import os
    import ujson
    import pandas as pd
    
    start_path = '/home/jayaramdas/anaconda3/Thesis/govtrack/bills109/hr'
    
    def get_merged_json(start_path):
        return [ujson.load(open(os.path.join(path, f)))
                for p, _, files in os.walk(start_path)
                for f in files
                if f.endswith('.json')
               ]
    
    df = pd.read_json(ujson.dumps(data))
    

    PS 它会将所有 committees 作为 JSON 数据放在一列中

    【讨论】:

    • 再次感谢 MaxU!我有一个小问题:fn应该指向什么?等等,我想我明白了。 fn= filename.
    • @MichaelPerdue,是的,它应该是您文件的完整或相对路径,包括其名称
    • 我已应用您的代码,但有一个例外。我已将 json 替换为 ujson,因为我得到了 NameError: name 'ujson' is not defined 。但是,它只返回一行。作为 fn,我正在使用 (path + "/" + dir + "/data.json", "r") 我可能可以使用它来让它工作,但你知道那是什么吗?
    • @MichaelPerdue,行数将根据每个文件中committees 列表中元素的数量而有所不同
    • @MichaelPerdue,我已经更新了我的答案 - 请检查。我还会提出一个关于如何将 JSON 列扩展为多列的新问题,因为这可能很棘手