【问题标题】:Unnesting / normalizing JSON in Python在 Python 中取消嵌套/规范化 JSON
【发布时间】:2020-01-09 22:58:01
【问题描述】:

我正在尝试在这里取消国会数据:https://theunitedstates.io/congress-legislators/legislators-historical.json

示例结构:

    {
    "id": {
      "bioguide": "B000226",
      "govtrack": 401222,
      "icpsr": 507,
      "wikipedia": "Richard Bassett (politician)",
      "wikidata": "Q518823",
      "google_entity_id": "kg:/m/02pz46"
    },
    "name": {
      "first": "Richard",
      "last": "Bassett"
    },
    "bio": {
      "birthday": "1745-04-02",
      "gender": "M"
    },
    "terms": [
      {
        "type": "sen",
        "start": "1789-03-04",
        "end": "1793-03-03",
        "state": "DE",
        "class": 2,
        "party": "Anti-Administration"
      }
    ]
  }

如果我只使用json_normalize(data),“条款”不会嵌套。

如果我尝试特别取消嵌套这些术语,例如 json_normalize(data, 'terms', 'name'),那么我包含的任何其他内容(此处为名称)都将保持 dict 格式,并以 {u'last': u'Bassett', u'first': u'Richard'} 作为行条目。

完整的当前代码,如果你想运行它:

import json
import urllib
import pandas as pd
from pandas.io.json import json_normalize

# load data
url = "https://theunitedstates.io/congress-legislators/legislators-historical.json"
json_url = urllib.urlopen(url)
data = json.loads(json_url.read())

# parse
congress_names = json_normalize(data, record_path='terms',meta='name')

【问题讨论】:

  • 你能解释/显示你想要得到什么样的输出吗?
  • 查看 json,我认为您需要制作几个单独的 df,它会很长,对于每个它都有一个您需要推断的唯一模式。

标签: python json pandas


【解决方案1】:

我认为下面的代码应该可以工作。可能有更好的方法来规范化,但我不知道。

import requests
import pandas as pd
import re
import json
from pandas.io.json import json_normalize

url = ' https://theunitedstates.io/congress-legislators/legislators-historical.json'
resp = requests.get(url)
raw_dict = json.loads(resp.text)

df = pd.DataFrame()
for i in range(len(raw_dict)):    
     df1 = json_normalize(raw_dict[i], record_path = ['terms'], meta = ['name'])
     df1 = pd.concat([df1, df1['name'].apply(pd.Series)], axis=1)
     df = pd.concat([df,df1], axis=0, ignore_index =True, sort=True)

【讨论】:

    【解决方案2】:

    当您将terms 指定为rec_path 时,您需要将其余列的路径列表指定为meta。使用列表推导构建如下列表

    from pandas.io import json
    
    l_meta = [[k, k1]  for k in data[0] if k != 'terms' for k1 in data[0][k]]
    congress_names = json.json_normalize(data, 'terms', l_meta, errors='ignore')
    
    Out[1105]:
      type       start         end state  class                party  district  \
    0  sen  1789-03-04  1793-03-03    DE    2.0  Anti-Administration       NaN
    1  rep  1789-03-04  1791-03-03    VA    NaN                  NaN       9.0
    
      id.bioguide id.govtrack id.icpsr                    id.wikipedia  \
    0     B000226      401222      507    Richard Bassett (politician)
    1     B000546      401521      786  Theodorick Bland (congressman)
    
      id.wikidata id.google_entity_id  name.first name.last bio.birthday  \
    0     Q518823        kg:/m/02pz46     Richard   Bassett   1745-04-02
    1    Q1749152        kg:/m/033mf4  Theodorick     Bland   1742-03-21
    
      bio.gender
    0          M
    1          M
    

    注意:我只从 data 中挑选前 2 个元素/对象用于此测试目的。我还假设第一个元素 (data[0]) 包含所有列。


    方法二:

    normalize 整个 data 作为主要 congress_names。在该切片之后,仅列 termsnormalize 将其转换为新的 df1 并重新加入

    congress_names = json.json_normalize(data)
    df1 = json.json_normalize(congress_names.terms.str[0])
    congress_names = congress_names.join(df1).drop('terms', axis=1)
    
    Out[1130]:
      id.bioguide  id.govtrack  id.icpsr                    id.wikipedia  \
    0     B000226       401222       507    Richard Bassett (politician)
    1     B000546       401521       786  Theodorick Bland (congressman)
    
      id.wikidata id.google_entity_id  name.first name.last bio.birthday  \
    0     Q518823        kg:/m/02pz46     Richard   Bassett   1745-04-02
    1    Q1749152        kg:/m/033mf4  Theodorick     Bland   1742-03-21
    
      bio.gender  id.house_history type       start         end state  class  \
    0          M               NaN  sen  1789-03-04  1793-03-03    DE    2.0
    1          M            9479.0  rep  1789-03-04  1791-03-03    VA    NaN
    
                     party  district
    0  Anti-Administration       NaN
    1                  NaN       9.0
    

    【讨论】:

      猜你喜欢
      • 2020-11-09
      • 1970-01-01
      • 2021-12-12
      • 1970-01-01
      • 2020-06-21
      • 2017-10-27
      • 2020-03-05
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多