【问题标题】:How to access data and handle missing data in a dictionaries within a dataframe如何在 pandas 数据框中访问数据并处理字典中的缺失数据?
【发布时间】:2022-08-05 09:43:43
【问题描述】:

给定,df:

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        \"Col1\": [1, 2, 3],
        \"Person\": [
            {
                \"ID\": 10001,
                \"Data\": {
                    \"Address\": {
                        \"Street\": \"1234 Street A\",
                        \"City\": \"Houston\",
                        \"State\": \"Texas\",
                        \"Zip\": \"77002\",
                    }
                },
                \"Age\": 30,
                \"Income\": 50000,
            },
            {
                \"ID\": 10002,
                \"Data\": {
                    \"Address\": {
                        \"Street\": \"7892 Street A\",
                        \"City\": \"Greenville\",
                        \"State\": \"Maine\",
                        \"Zip\": np.nan,
                    }
                },
                \"Age\": np.nan,
                \"Income\": 63000,
            },
            {\"ID\": 10003, \"Data\": {\"Address\": np.nan}, \"Age\": 56, \"Income\": 85000},
        ],
    },
)

输入数据框:

   Col1                                             Person
0     1  {\'ID\': 10001, \'Data\': {\'Address\': {\'Street\': \'...
1     2  {\'ID\': 10002, \'Data\': {\'Address\': {\'Street\': \'...
2     3  {\'ID\': 10003, \'Data\': {\'Address\': nan}, \'Age\':...

我的预期输出数据框是df[[\'Col1\', \'Income\', \'Age\', \'Street\', \'Zip\']],其中收入、年龄、街道和邮编来自 Person 内部:

   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    nan
2     3   85000  56.0            NaN    nan

    标签: python-3.x pandas dataframe


    【解决方案1】:

    使用列表推导,我们可以创建大部分这些列。

    df['Income'] = [x.get('Income') for x in df['Person']]
    df['Age'] = [x.get('Age') for x in df['Person']]
    df['Age']
    

    输出:

    0    30.0
    1     NaN
    2    56.0
    Name: Age, dtype: float64
    

    但是,在嵌套字典中处理 np.nan 值是一件非常痛苦的事情。让我们看一下从其中一个值为 nan 的嵌套字典数据中获取数据。

    df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
    

    我们得到一个 AttributeError:

    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-80-cc2f92bfe95d> in <module>
          1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
          2 
    ----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
          4 
          5 #We get and AttributeError because NoneType object has no get method
    
    <ipython-input-80-cc2f92bfe95d> in <listcomp>(.0)
          1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
          2 
    ----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
          4 
          5 #We get and AttributeError because NoneType object has no get method
    
    AttributeError: 'float' object has no attribute 'get'
    

    让我们使用带有字典键的.str 访问器来获取这些数据。
    pandas 中几乎没有文档显示如何使用.str.get.str[] 从数据框列/熊猫系列中的字典对象中获取值。

    df['Street'] = df['Person'].str['Data'].str['Address'].str['Street']
    

    输出:

    0    1234 Street A
    1    7892 Street A
    2              NaN
    Name: Street, dtype: object
    

    而且,同样与

    df['Zip'] = df['Person'].str['Data'].str['Address'].str['Zip']
    

    离开使用列来构建所需的数据框 df[['Col1', 'Income', 'Age', 'Street', 'Zip']] 从字典。

    输出:

       Col1  Income   Age         Street    Zip
    0     1   50000  30.0  1234 Street A  77002
    1     2   63000   NaN  7892 Street A    NaN
    2     3   85000  56.0            NaN    NaN
    

    【讨论】:

      猜你喜欢
      • 2020-04-11
      • 2017-02-10
      • 2015-12-10
      • 1970-01-01
      • 2022-01-16
      • 2020-08-28
      • 2020-07-28
      • 1970-01-01
      • 2018-06-05
      相关资源
      最近更新 更多