如何在 pandas 数据框中访问数据并处理字典中的缺失数据？答案

【问题标题】：How to access data and handle missing data in a dictionaries within a dataframe如何在 pandas 数据框中访问数据并处理字典中的缺失数据？
【发布时间】：2022-08-05 09:43:43
【问题描述】：

给定，df：

import pandas as pd
import numpy as np

df = pd.DataFrame(
    {
        \"Col1\": [1, 2, 3],
        \"Person\": [
            {
                \"ID\": 10001,
                \"Data\": {
                    \"Address\": {
                        \"Street\": \"1234 Street A\",
                        \"City\": \"Houston\",
                        \"State\": \"Texas\",
                        \"Zip\": \"77002\",
                    }
                },
                \"Age\": 30,
                \"Income\": 50000,
            },
            {
                \"ID\": 10002,
                \"Data\": {
                    \"Address\": {
                        \"Street\": \"7892 Street A\",
                        \"City\": \"Greenville\",
                        \"State\": \"Maine\",
                        \"Zip\": np.nan,
                    }
                },
                \"Age\": np.nan,
                \"Income\": 63000,
            },
            {\"ID\": 10003, \"Data\": {\"Address\": np.nan}, \"Age\": 56, \"Income\": 85000},
        ],
    },
)

输入数据框：

   Col1                                             Person
0     1  {\'ID\': 10001, \'Data\': {\'Address\': {\'Street\': \'...
1     2  {\'ID\': 10002, \'Data\': {\'Address\': {\'Street\': \'...
2     3  {\'ID\': 10003, \'Data\': {\'Address\': nan}, \'Age\':...

我的预期输出数据框是df[[\'Col1\', \'Income\', \'Age\', \'Street\', \'Zip\']]，其中收入、年龄、街道和邮编来自 Person 内部：

   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    nan
2     3   85000  56.0            NaN    nan

标签： python-3.x pandas dataframe

【解决方案1】：

使用列表推导，我们可以创建大部分这些列。

df['Income'] = [x.get('Income') for x in df['Person']]
df['Age'] = [x.get('Age') for x in df['Person']]
df['Age']

输出：

0    30.0
1     NaN
2    56.0
Name: Age, dtype: float64

但是，在嵌套字典中处理 np.nan 值是一件非常痛苦的事情。让我们看一下从其中一个值为 nan 的嵌套字典数据中获取数据。

df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]

我们得到一个 AttributeError：

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-80-cc2f92bfe95d> in <module>
      1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
      2 
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
      4 
      5 #We get and AttributeError because NoneType object has no get method

<ipython-input-80-cc2f92bfe95d> in <listcomp>(.0)
      1 #However, let's look at getting data rom a nested dictionary where one of the values is nan.
      2 
----> 3 df['Street'] = [x.get('Data').get('Address').get('Street') for x in df['Person']]
      4 
      5 #We get and AttributeError because NoneType object has no get method

AttributeError: 'float' object has no attribute 'get'

让我们使用带有字典键的.str 访问器来获取这些数据。
pandas 中几乎没有文档显示如何使用.str.get 或.str[] 从数据框列/熊猫系列中的字典对象中获取值。

df['Street'] = df['Person'].str['Data'].str['Address'].str['Street']

输出：

0    1234 Street A
1    7892 Street A
2              NaN
Name: Street, dtype: object

而且，同样与

df['Zip'] = df['Person'].str['Data'].str['Address'].str['Zip']

离开使用列来构建所需的数据框 df[['Col1', 'Income', 'Age', 'Street', 'Zip']] 从字典。

输出：

   Col1  Income   Age         Street    Zip
0     1   50000  30.0  1234 Street A  77002
1     2   63000   NaN  7892 Street A    NaN
2     3   85000  56.0            NaN    NaN

【讨论】：