【问题标题】:fill missing value based on one column to another根据一列将缺失值填充到另一列
【发布时间】:2022-01-18 16:34:47
【问题描述】:

我有两列这样的:

我想要做的是假设 'age' 列值在 30-39 之间,我想填充 age_band = 30 的缺失值。就像假设 'age' 列值在 80-89 之间,我想填充 age_band = 80 的缺失值。如何在 pandas 数据框中执行此操作?

我试过这样,但循环一直在运行

for ages in data['age']:
if 0<=ages<=9:
    data['age_band']= data['age_band'].fillna(0)
elif 10<=ages<=19:
    data['age_band']= data['age_band'].fillna(10)
elif 20<=ages<=29:
    data['age_band']= data['age_band'].fillna(20)
elif 30<=ages<=39:
    data['age_band']= data['age_band'].fillna(30)
elif 40<=ages<=49:
    data['age_band']= data['age_band'].fillna(40)
elif 50<=ages<=59:
    data['age_band']= data['age_band'].fillna(50)
elif 60<=ages<=69:
    data['age_band']= data['age_band'].fillna(60)
elif 70<=ages<=79:
    data['age_band']= data['age_band'].fillna(70)
elif 80<=ages<=89:
    data['age_band']= data['age_band'].fillna(80)
elif 90<=ages<=99:
    data['age_band']= data['age_band'].fillna(90)
elif 100<=ages<=109:
    data['age_band']= data['age_band'].fillna(100)

请帮帮我

【问题讨论】:

    标签: python pandas dataframe data-cleaning missing-data


    【解决方案1】:

    试试这个快捷方式:

    data['age_band'] = data['age_band'].fillna(data['age'] // 10 * 10).astype(int)
    print(data)
    
    # Output
       age  age_band
    0   93        90
    1   46        40
    2   50        50
    3   56        50
    4   89        80
    5   19        10
    6   25        20
    7   17        10
    8   54        50
    9   42        40
    

    设置:

    import pandas as pd
    import numpy as np
    
    np.random.seed(2022)
    data = pd.DataFrame({'age': np.random.randint(1, 111, 10), 'age_band': np.nan})
    print(data)
    
    # Output
       age  age_band
    0   93       NaN
    1   46       NaN
    2   50       NaN
    3   56       NaN
    4   89       NaN
    5   19       NaN
    6   25       NaN
    7   17       NaN
    8   54       NaN
    9   42       NaN
    

    【讨论】:

    • 感谢这项工作。我现在觉得自己像个傻瓜
    • 你不应该!乐意效劳。如果这适合您的需要,请考虑accept my answer :)
    【解决方案2】:

    上述答案仅在年龄箱相等时才有效,您可以尝试 pd.cut ,它适用于所有场景。

    您也可以对 pd.cut() 使用标签。以下示例包含 0-9 范围内的年龄。我们正在添加一个名为“age alband”的新列来对年龄进行分类

    bins表示区间:0-9为1个区间,10-19为1个区间,以此类推对应的标签为“0-9”等

    bins = [0, 9,19,29,39,49,59,69,79,89,99,109]
    labels = ["0-9","10-19","20-29","30-39","40-49","50-59","60-69","70-79","80-89","90-99","100-109",">109"]
    data['age_band']= pd.cut(data['age'], bins=bins, labels=labels)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-01-24
      相关资源
      最近更新 更多