挖掘 json 文件答案

【问题标题】：Digging down json file挖掘 json 文件
【发布时间】：2018-04-29 09:17:50
【问题描述】：

我一直在尝试以多种方式（以及通过 stackoverflow 中的许多问题）来规范化深度 json 文件。我试过.apply(pd.Series)，对很多级别的字典都不是很好。

我目前正在尝试使用json_normalize，它已经给出了一些结果。我想我知道这个函数是如何工作的，我的问题是我不知道如何浏览字典。

到目前为止，我已经能够挖掘到 2 个级别。

import json
import pandas as pd
from pandas.io.json import json_normalize
raw = json.load(open('authors.json'))
raw2 = json_normalize(raw['hits']['hits'])

它给了我我需要的东西（至少是第一级）。但我不知道如何深入。

我试过了：

raw2 = json_normalize(raw['hits']['hits'][0])
raw2 = json_normalize(raw['hits']['hits']['_source.authors'])
TypeError: string indices must be integers

还有更多，但只是在不理解的情况下随意尝试一些东西不是正确的方法。我想我的问题是：

我如何知道如何包含下一个级别（{} 与 [] 在 json 中）？
是否有任何视觉方式来表示这一点？

奇怪的是，这个主题没有更多地在线开发。日复一日，我越来越多地使用json 数据。

_id _index  _score  _source.authors _source.deleted _source.description _source.doi _source.is_valid    _source.issue   _source.journal ... _source.rating_versatility_weighted _source.review_count    _source.tag _source.title   _source.userAvg _source.user_id _source.venue_name  _source.views_count _source.volume  _type   
0   7CB3F2AD    scibase_listings    1   None    0   None        1   None    Physical Review Letters ... 0   0   [mass spectra, elementary particles, bound sta...   Evidence for a new meson: A quasinuclear NN-ba...   0   None    Physical Review Letters 0   None    listing
1   7AF8EBC3    scibase_listings    1   [{'affiliations': ['Punjabi University'], 'aut...   0   None        1   None    Journal of Industrial Microbiology & Biotechno...   ... 0   0   [flow rate, operant conditioning, packed bed r...   Development of a stable continuous flow immobi...   0   None    Journal of Industrial Microbiology & Biotechno...   0   None    listing
2   7521A721    scibase_listings    1   [{'author_id': '7FF872BC', 'author_name': 'bar...   0   None        1   None    The American Historical Review  ... 0   0   [social movements]  Feminism and the women's movement : dynamics o...   0   None    The American Historical Review  0   None    listing

这是文件的一个块（这是第 3 级，第 1 级和第 2 级是，命中，命中）。

{
"_shards": {
    "failed": 0,
    "successful": 5,
    "total": 5
},
"hits": {
    "hits": [{
            "_id": "7CB3F2AD",
            "_index": "scibase_listings",
            "_type": "listing",
            "_score": 1,
            "_source": {
                "userAvg": 0,
                "meta_keywords": null,
                "views_count": 0,
                "rating_reproducability": 0,
                "rating_versatility": 0,
                "rating_innovation": 0,
                "tag": null,
                "rating_reproducibility_weighted": 0,
                "meta_description": null,
                "review_count": 0,
                "rating_avg_weighted": 0,
                "venue_name": "The American Historical Review",
                "rating_num_weighted": 0,
                "is_valid": 1,
                "normalized_venue_name": "american historical review",
                "rating_clarity": 0,
                "description": null,
                "deleted": 0,
                "journal": "The American Historical Review",
                "volume": null,
                "link": null,
                "authors": [{
                        "author_id": "166468F4",
                        "author_name": "a bowdoin van riper"
                    },
                    {
                        "author_id": "81070854",
                        "author_name": "jeffrey h schwartz"
                    }
                ],
                "user_id": null,
                "pub_date": "1994-01-01 00:00:00",
                "pages": null,
                "doi": "",
                "issue": null,
                "rating_versatility_weighted": 0,
                "pubtype": null,
                "title": "Men Among the Mammoths: Victorian Science and the Discovery of Human Prehistory",
                "rating_clarity_weighted": 0,
                "rating_innovation_weighted": 0
            }
        },
        {
            "_index": "scibase_listings",
            "_type": "listing",
            "_id": "7538108B",
            "_score": 1,
            "_source": {
                "userAvg": 0,
                "meta_keywords": null,
                "views_count": 0,
                "rating_reproducability": 0,
                "rating_versatility": 0,
                "rating_innovation": 0,
                "tag": null,
                "rating_reproducibility_weighted": 0,
                "meta_description": null,
                "review_count": 0,
                "rating_avg_weighted": 0,
                "venue_name": "The American Historical Review",
                "rating_num_weighted": 0,
                "is_valid": 1,
                "normalized_venue_name": "american historical review",
                "rating_clarity": 0,
                "description": null,
                "deleted": 0,
                "journal": "The American Historical Review",
                "volume": null,
                "link": null,
                "authors": [{
                    "affiliations": [
                        "Pennsylvania State University"
                    ],
                    "author_id": "7E15BDFA",
                    "author_name": "roger l geiger"
                }],
                "user_id": null,
                "pub_date": "2013-06-01 00:00:00",
                "pages": null,
                "doi": "10.1093/ahr/118.3.896a",
                "issue": null,
                "rating_versatility_weighted": 0,
                "pubtype": null,
                "title": "Elizabeth Popp Berman. Creating the Market University: How Academic Science Became an Economic Engine.",
                "rating_clarity_weighted": 0,
                "rating_innovation_weighted": 0
            }
        }
    ]
}

}

【问题讨论】：

您介意指定一个 valid JSON 字符串/文件，它可以被解析吗？尝试从您的问题中复制它并将其传递给 json.loads(json_string) 另一个有用的资源是：jsonlint.com - 它在线验证 JSON 文件
那个不行？是原版的一大块。
我已经把无效的json改成了有效的。

标签： json python-3.x pandas

【解决方案1】：

你可以试试这个：

json_normalize(raw['hits'],'hits','_source','authors','affiliations')

【讨论】：

【解决方案2】：

我想我想出了如何通过 json 来“挖掘”。这将取决于下一级是列表还是字典。

在我的情况下，我能够挖掘到下面的结尾。我仍然需要了解如何使用完整列表（可能是循环），这样我才能拥有所有值，而不仅仅是 [0] 或 [1]。

raw['hits']['hits'][1]['_source']['authors'][0]['affiliations']

【讨论】：