【问题标题】:Digging down json file挖掘 json 文件
【发布时间】:2018-04-29 09:17:50
【问题描述】:

我一直在尝试以多种方式(以及通过 stackoverflow 中的许多问题)来规范化深度 json 文件。 我试过.apply(pd.Series),对很多级别的字典都不是很好。

我目前正在尝试使用json_normalize,它已经给出了一些结果。我想我知道这个函数是如何工作的,我的问题是我不知道如何浏览字典。

到目前为止,我已经能够挖掘到 2 个级别。

import json
import pandas as pd
from pandas.io.json import json_normalize
raw = json.load(open('authors.json'))
raw2 = json_normalize(raw['hits']['hits'])

它给了我我需要的东西(至少是第一级)。但我不知道如何深入。

我试过了:

raw2 = json_normalize(raw['hits']['hits'][0])
raw2 = json_normalize(raw['hits']['hits']['_source.authors'])
TypeError: string indices must be integers

还有更多,但只是在不理解的情况下随意尝试一些东西不是正确的方法。我想我的问题是:

  • 我如何知道如何包含下一个级别({}[] 在 json 中)?
  • 是否有任何视觉方式来表示这一点?

奇怪的是,这个主题没有更多地在线开发。日复一日,我越来越多地使用json 数据。

_id _index  _score  _source.authors _source.deleted _source.description _source.doi _source.is_valid    _source.issue   _source.journal ... _source.rating_versatility_weighted _source.review_count    _source.tag _source.title   _source.userAvg _source.user_id _source.venue_name  _source.views_count _source.volume  _type   
0   7CB3F2AD    scibase_listings    1   None    0   None        1   None    Physical Review Letters ... 0   0   [mass spectra, elementary particles, bound sta...   Evidence for a new meson: A quasinuclear NN-ba...   0   None    Physical Review Letters 0   None    listing
1   7AF8EBC3    scibase_listings    1   [{'affiliations': ['Punjabi University'], 'aut...   0   None        1   None    Journal of Industrial Microbiology & Biotechno...   ... 0   0   [flow rate, operant conditioning, packed bed r...   Development of a stable continuous flow immobi...   0   None    Journal of Industrial Microbiology & Biotechno...   0   None    listing
2   7521A721    scibase_listings    1   [{'author_id': '7FF872BC', 'author_name': 'bar...   0   None        1   None    The American Historical Review  ... 0   0   [social movements]  Feminism and the women's movement : dynamics o...   0   None    The American Historical Review  0   None    listing

这是文件的一个块(这是第 3 级,第 1 级和第 2 级是,命中,命中)。

{
"_shards": {
    "failed": 0,
    "successful": 5,
    "total": 5
},
"hits": {
    "hits": [{
            "_id": "7CB3F2AD",
            "_index": "scibase_listings",
            "_type": "listing",
            "_score": 1,
            "_source": {
                "userAvg": 0,
                "meta_keywords": null,
                "views_count": 0,
                "rating_reproducability": 0,
                "rating_versatility": 0,
                "rating_innovation": 0,
                "tag": null,
                "rating_reproducibility_weighted": 0,
                "meta_description": null,
                "review_count": 0,
                "rating_avg_weighted": 0,
                "venue_name": "The American Historical Review",
                "rating_num_weighted": 0,
                "is_valid": 1,
                "normalized_venue_name": "american historical review",
                "rating_clarity": 0,
                "description": null,
                "deleted": 0,
                "journal": "The American Historical Review",
                "volume": null,
                "link": null,
                "authors": [{
                        "author_id": "166468F4",
                        "author_name": "a bowdoin van riper"
                    },
                    {
                        "author_id": "81070854",
                        "author_name": "jeffrey h schwartz"
                    }
                ],
                "user_id": null,
                "pub_date": "1994-01-01 00:00:00",
                "pages": null,
                "doi": "",
                "issue": null,
                "rating_versatility_weighted": 0,
                "pubtype": null,
                "title": "Men Among the Mammoths: Victorian Science and the Discovery of Human Prehistory",
                "rating_clarity_weighted": 0,
                "rating_innovation_weighted": 0
            }
        },
        {
            "_index": "scibase_listings",
            "_type": "listing",
            "_id": "7538108B",
            "_score": 1,
            "_source": {
                "userAvg": 0,
                "meta_keywords": null,
                "views_count": 0,
                "rating_reproducability": 0,
                "rating_versatility": 0,
                "rating_innovation": 0,
                "tag": null,
                "rating_reproducibility_weighted": 0,
                "meta_description": null,
                "review_count": 0,
                "rating_avg_weighted": 0,
                "venue_name": "The American Historical Review",
                "rating_num_weighted": 0,
                "is_valid": 1,
                "normalized_venue_name": "american historical review",
                "rating_clarity": 0,
                "description": null,
                "deleted": 0,
                "journal": "The American Historical Review",
                "volume": null,
                "link": null,
                "authors": [{
                    "affiliations": [
                        "Pennsylvania State University"
                    ],
                    "author_id": "7E15BDFA",
                    "author_name": "roger l geiger"
                }],
                "user_id": null,
                "pub_date": "2013-06-01 00:00:00",
                "pages": null,
                "doi": "10.1093/ahr/118.3.896a",
                "issue": null,
                "rating_versatility_weighted": 0,
                "pubtype": null,
                "title": "Elizabeth Popp Berman. Creating the Market University: How Academic Science Became an Economic Engine.",
                "rating_clarity_weighted": 0,
                "rating_innovation_weighted": 0
            }
        }
    ]
}

}

【问题讨论】:

  • 您介意指定一个 valid JSON 字符串/文件,它可以被解析吗?尝试从您的问题中复制它并将其传递给 json.loads(json_string) 另一个有用的资源是:jsonlint.com - 它在线验证 JSON 文件
  • 那个不行?是原版的一大块。
  • 我已经把无效的json改成了有效的。

标签: json python-3.x pandas


【解决方案1】:

你可以试试这个:

json_normalize(raw['hits'],'hits','_source','authors','affiliations')

【讨论】:

    【解决方案2】:

    我想我想出了如何通过 json 来“挖掘”。这将取决于下一级是列表还是字典。

    在我的情况下,我能够挖掘到下面的结尾。我仍然需要了解如何使用完整列表(可能是循环),这样我才能拥有所有值,而不仅仅是 [0][1]

    raw['hits']['hits'][1]['_source']['authors'][0]['affiliations']
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2011-02-07
      • 2019-05-05
      • 2013-02-09
      • 1970-01-01
      • 1970-01-01
      • 2012-01-25
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多