【问题标题】:pandas.io.json.json_normalize with very nested jsonpandas.io.json.json_normalize 带有非常嵌套的 json
【发布时间】:2018-04-24 20:48:37
【问题描述】:

我一直在尝试normalize 一个非常嵌套的 json 文件,我稍后会分析。我正在努力解决的问题是如何深入一层以上才能正常化。

我浏览了pandas.io.json.json_normalize 文档,因为它完全符合我的要求。

我已经能够规范化其中的一部分,现在了解字典的工作原理,但我仍然不在那里。

使用下面的代码,我只能获得第一级。

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('authors_sample.json') as f:
    d = json.load(f)

raw = json_normalize(d['hits']['hits'])

authors = json_normalize(data = d['hits']['hits'], 
                         record_path = '_source', 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

我正在尝试使用以下代码“挖掘”“作者”字典,但 record_path = ['_source', 'authors'] 将我抛出 TypeError: string indices must be integers。据我了解json_normalize 逻辑应该很好,但我仍然不太明白如何使用dictlist 深入了解json。

我什至经历了这个简单的example

authors = json_normalize(data = d['hits']['hits'], 
                         record_path = ['_source', 'authors'], 
                         meta = ['_id', ['_source', 'journal'], ['_source', 'title'], 
                                 ['_source', 'normalized_venue_name']
                                 ])

下面是一段 json 文件(5 条记录)。

{u'_shards': {u'failed': 0, u'successful': 5, u'total': 5},
 u'hits': {u'hits': [{u'_id': u'7CB3F2AD',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': None,
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'Physical Review Letters',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'phys rev lett',
     u'pages': None,
     u'parent_keywords': [u'Chromatography',
      u'Quantum mechanics',
      u'Particle physics',
      u'Quantum field theory',
      u'Analytical chemistry',
      u'Quantum chromodynamics',
      u'Physics',
      u'Mass spectrometry',
      u'Chemistry'],
     u'pub_date': u'1987-03-02 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'mass spectra', u'elementary particles', u'bound states'],
     u'title': u'Evidence for a new meson: A quasinuclear NN-bar bound state',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'Physical Review Letters',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7AF8EBC3',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'affiliations': [u'Punjabi University'],
       u'author_id': u'780E3459',
       u'author_name': u'munish puri'},
      {u'affiliations': [u'Punjabi University'],
       u'author_id': u'48D92C79',
       u'author_name': u'rajesh dhaliwal'},
      {u'affiliations': [u'Punjabi University'],
       u'author_id': u'7D9BD37C',
       u'author_name': u'r s singh'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'Journal of Industrial Microbiology & Biotechnology',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'j ind microbiol biotechnol',
     u'pages': None,
     u'parent_keywords': [u'Nuclear medicine',
      u'Psychology',
      u'Hydrology',
      u'Chromatography',
      u'X-ray crystallography',
      u'Nuclear fusion',
      u'Medicine',
      u'Fluid dynamics',
      u'Thermodynamics',
      u'Physics',
      u'Gas chromatography',
      u'Radiobiology',
      u'Engineering',
      u'Organic chemistry',
      u'High-performance liquid chromatography',
      u'Chemistry',
      u'Organic synthesis',
      u'Psychotherapist'],
     u'pub_date': u'2008-04-04 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'flow rate',
      u'operant conditioning',
      u'packed bed reactor',
      u'immobilized enzyme',
      u'specific activity'],
     u'title': u'Development of a stable continuous flow immobilized enzyme reactor for the hydrolysis of inulin',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'Journal of Industrial Microbiology & Biotechnology',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7521A721',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'7FF872BC',
       u'author_name': u'barbara eileen ryan'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Social science',
      u'Politics',
      u'Sociology',
      u'Law'],
     u'pub_date': u'1992-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'social movements'],
     u'title': u"Feminism and the women's movement : dynamics of change in social movement ideology, and activism",
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7DAEB9A4',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'0299B8E9',
       u'author_name': u'fraser j harbutt'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Superconductivity',
      u'Nuclear fusion',
      u'Geology',
      u'Chemistry',
      u'Metallurgy'],
     u'pub_date': u'1988-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'iron'],
     u'title': u'The iron curtain : Churchill, America, and the origins of the Cold War',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'},
   {u'_id': u'7B3236C5',
    u'_index': u'scibase_listings',
    u'_score': 1.0,
    u'_source': {u'authors': [{u'author_id': u'7DAB7B72',
       u'author_name': u'richard m freeland'}],
     u'deleted': 0,
     u'description': None,
     u'doi': u'',
     u'is_valid': 1,
     u'issue': None,
     u'journal': u'The American Historical Review',
     u'link': None,
     u'meta_description': None,
     u'meta_keywords': None,
     u'normalized_venue_name': u'american historical review',
     u'pages': None,
     u'parent_keywords': [u'Political Science', u'Economics'],
     u'pub_date': u'1985-01-01 00:00:00',
     u'pubtype': None,
     u'rating_avg_weighted': 0,
     u'rating_clarity': 0.0,
     u'rating_clarity_weighted': 0.0,
     u'rating_innovation': 0.0,
     u'rating_innovation_weighted': 0.0,
     u'rating_num_weighted': 0,
     u'rating_reproducability': 0,
     u'rating_reproducibility_weighted': 0.0,
     u'rating_versatility': 0.0,
     u'rating_versatility_weighted': 0.0,
     u'review_count': 0,
     u'tag': [u'foreign policy'],
     u'title': u'The Truman Doctrine and the origins of McCarthyism : foreign policy, domestic politics, and internal security, 1946-1948',
     u'userAvg': 0.0,
     u'user_id': None,
     u'venue_name': u'The American Historical Review',
     u'views_count': 0,
     u'volume': None},
    u'_type': u'listing'}],
  u'max_score': 1.0,
  u'total': 36429433},
 u'timed_out': False,
 u'took': 170}

【问题讨论】:

    标签: python json python-3.x pandas normalize


    【解决方案1】:

    在 pandas 示例(如下)中,括号是什么意思?是否有逻辑要遵循 []. [...]

    result = json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
    

    ['state', 'shortname', ['info', 'governor']] 值中的每个字符串或字符串列表都是要包含的元素的路径,除了选定的行。第二个参数json_normalize() 参数(record_path,在文档示例中设置为'counties')告诉函数如何从输入数据结构中选择构成输出行的元素,meta 路径添加将包含在每一行中的更多元数据。如果您愿意,可以将这些视为数据库中的表连接。

    the US States documentation example 的输入在一个列表中有两个字典,这两个字典都有一个 counties 键,它引用另一个字典列表:

    >>> data = [{'state': 'Florida',
    ...          'shortname': 'FL',
    ...         'info': {'governor': 'Rick Scott'},
    ...         'counties': [{'name': 'Dade', 'population': 12345},
    ...                      {'name': 'Broward', 'population': 40000},
    ...                      {'name': 'Palm Beach', 'population': 60000}]},
    ...         {'state': 'Ohio',
    ...          'shortname': 'OH',
    ...          'info': {'governor': 'John Kasich'},
    ...          'counties': [{'name': 'Summit', 'population': 1234},
    ...                       {'name': 'Cuyahoga', 'population': 1337}]}]
    >>> pprint(data[0]['counties'])
    [{'name': 'Dade', 'population': 12345},
     {'name': 'Broward', 'population': 40000},
     {'name': 'Palm Beach', 'population': 60000}]
    >>> pprint(data[1]['counties'])
    [{'name': 'Summit', 'population': 1234},
     {'name': 'Cuyahoga', 'population': 1337}]
    

    在它们之间有 5 行数据用于输出:

    >>> json_normalize(data, 'counties')
             name  population
    0        Dade       12345
    1     Broward       40000
    2  Palm Beach       60000
    3      Summit        1234
    4    Cuyahoga        1337
    

    meta 参数然后将一些元素命名为 next 到那些 counties 列表,然后将它们分别合并。第一个 data[0] 字典中 meta 元素的值分别是 ('Florida', 'FL', 'Rick Scott'),而 data[1] 的值是 ('Ohio', 'OH', 'John Kasich'),因此您会看到附加到 counties 行的那些值来自同一个顶级字典,分别重复 3 次和 2 次:

    >>> data[0]['state'], data[0]['shortname'], data[0]['info']['governor']
    ('Florida', 'FL', 'Rick Scott')
    >>> data[1]['state'], data[1]['shortname'], data[1]['info']['governor']
    ('Ohio', 'OH', 'John Kasich')
    >>> json_normalize(data, 'counties', ['state', 'shortname', ['info', 'governor']])
             name  population    state shortname info.governor
    0        Dade       12345  Florida        FL    Rick Scott
    1     Broward       40000  Florida        FL    Rick Scott
    2  Palm Beach       60000  Florida        FL    Rick Scott
    3      Summit        1234     Ohio        OH   John Kasich
    4    Cuyahoga        1337     Ohio        OH   John Kasich
    

    因此,如果您为 meta 参数传入一个列表,则列表中的每个元素都是一个单独的路径,并且这些单独的路径中的每一个都标识要添加到输出行中的数据。

    你的示例 JSON 中,只有几个嵌套列表可以使用第一个参数提升,就像示例中的 'counties' 所做的那样。该数据结构中的唯一示例是嵌套的'authors' 键;您必须提取每个 ['_source', 'authors'] 路径,之后您可以从父对象添加其他键来扩充这些行。

    第二个meta 参数然后从最外面的对象中拉入_id 键,然后是嵌套的['_source', 'title']['_source', 'journal'] 嵌套路径。

    record_path 参数以authors 列表为起点,如下所示:

    >>> d['hits']['hits'][0]['_source']['authors']   # this value is None, and is skipped
    >>> d['hits']['hits'][1]['_source']['authors']
    [{'affiliations': ['Punjabi University'],
      'author_id': '780E3459',
      'author_name': 'munish puri'},
     {'affiliations': ['Punjabi University'],
      'author_id': '48D92C79',
      'author_name': 'rajesh dhaliwal'},
     {'affiliations': ['Punjabi University'],
      'author_id': '7D9BD37C',
      'author_name': 'r s singh'}]
    >>> d['hits']['hits'][2]['_source']['authors']
    [{'author_id': '7FF872BC',
      'author_name': 'barbara eileen ryan'}]
    >>> # etc.
    

    因此为您提供以下行:

    >>> json_normalize(d['hits']['hits'], ['_source', 'authors'])
               affiliations author_id          author_name
    0  [Punjabi University]  780E3459          munish puri
    1  [Punjabi University]  48D92C79      rajesh dhaliwal
    2  [Punjabi University]  7D9BD37C            r s singh
    3                   NaN  7FF872BC  barbara eileen ryan
    4                   NaN  0299B8E9     fraser j harbutt
    5                   NaN  7DAB7B72   richard m freeland
    

    然后我们可以使用第三个meta 参数添加更多列,例如_id_source.title_source.journal,使用['_id', ['_source', 'journal'], ['_source', 'title']]

    >>> json_normalize(
    ...     data['hits']['hits'],
    ...     ['_source', 'authors'],
    ...     ['_id', ['_source', 'journal'], ['_source', 'title']]
    ... )
               affiliations author_id          author_name       _id   \
    0  [Punjabi University]  780E3459          munish puri  7AF8EBC3  
    1  [Punjabi University]  48D92C79      rajesh dhaliwal  7AF8EBC3
    2  [Punjabi University]  7D9BD37C            r s singh  7AF8EBC3
    3                   NaN  7FF872BC  barbara eileen ryan  7521A721
    4                   NaN  0299B8E9     fraser j harbutt  7DAEB9A4
    5                   NaN  7DAB7B72   richard m freeland  7B3236C5
    
                                         _source.journal
    0  Journal of Industrial Microbiology & Biotechno...
    1  Journal of Industrial Microbiology & Biotechno...
    2  Journal of Industrial Microbiology & Biotechno...
    3                     The American Historical Review
    4                     The American Historical Review
    5                     The American Historical Review
    
                                           _source.title  \
    0  Development of a stable continuous flow immobi...
    1  Development of a stable continuous flow immobi...
    2  Development of a stable continuous flow immobi...
    3  Feminism and the women's movement : dynamics o...
    4  The iron curtain : Churchill, America, and the...
    5  The Truman Doctrine and the origins of McCarth...
    

    【讨论】:

    • 这个答案对我来说有点难以理解。我希望语言更清晰。
    • @RodrikTheReader:我已经更新了我的答案,希望现在比以前更清楚了。
    • 我希望您将数据添加到您的答案中,以便更容易理解。不过,这是关于 json_normalize 的最佳解释。链接到上述数据示例pandas.pydata.org/pandas-docs/stable/reference/api/…
    • @iRestMyCaseYourHonor:数据曾经是问题的一部分,不幸的是它在这方面发生了如此大的变化。我现在在我的答案中添加了参考和字典。
    【解决方案2】:

    您还可以查看库 flatten_json,它不需要您像在 json_normalize 中那样编写列层次结构:

    from flatten_json import flatten
    
    data = d['hits']['hits']
    dict_flattened = (flatten(record, '.') for record in data)
    df = pd.DataFrame(dict_flattened)
    print(df)
    

    https://github.com/amirziai/flatten

    【讨论】:

    • 感谢您提供此解决方案。这帮助我最大限度地减少了我使用 pandas normalize 函数编写的大量代码
    • 谢谢你!之前不知道 flatten_json 库。这将大大减少我的代码。
    猜你喜欢
    • 2019-02-05
    • 1970-01-01
    • 2018-08-07
    • 2019-05-07
    • 1970-01-01
    • 2014-04-26
    • 2020-07-21
    • 2022-10-12
    相关资源
    最近更新 更多