【问题标题】:Python 3 Pandas Combining or merging columns with similar dataPython 3 Pandas 组合或合并具有相似数据的列
【发布时间】:2014-05-10 20:12:29
【问题描述】:

我有一个数据框,我正在尝试使用性别列更新性别列

import pandas as pd
import numpy as np

df=pd.DataFrame({'Users': [ 'Al Gore', 'Ned Flonders', 'Kim jong un', 'Al Sharpton', 'Michele', 'Richard Johnson', 'Taylor Swift', 'Alf pig', 'Dick Johnson', 'Dana Jovy'],
                 'Gender': [np.nan,'Male','Male','Male',np.nan,np.nan, 'Female',np.nan,'Male','Female'],
                 'Sex': ['M',np.nan,np.nan,'M','F',np.nan, 'F',np.nan,np.nan,'F']})

输出

>>> 
   Gender  Sex            Users
0     NaN    M          Al Gore
1    Male  NaN     Ned Flonders
2    Male  NaN      Kim jong un
3    Male    M      Al Sharpton
4     NaN    F          Michele
5     NaN  NaN  Richard Johnson
6  Female    F     Taylor Swift
7     NaN  NaN          Alf pig
8    Male  NaN     Dick Johnson
9  Female    F        Dana Jovy

[10 rows x 3 columns]

因此,如果“性别”列中为男性,则在性别列中将显示为 M。

到目前为止,这是我尝试过的:

df['Sex2']=(df.Gender.isin(['Male']).map({True:'M',False:''}) +
                df.Sex.isin(['M']).map({True:'M',False:''}) +
                df.Sex.isin(['F']).map({True:'F',False:''})+
                df.Gender.isin(['Female']).map({True:'F',False:''}))

print(df)

输出

[10 rows x 3 columns]
   Gender  Sex            Users Sex2
0     NaN    M          Al Gore    M
1    Male  NaN     Ned Flonders    M
2    Male  NaN      Kim jong un    M
3    Male    M      Al Sharpton   MM
4     NaN    F          Michele    F
5     NaN  NaN  Richard Johnson     
6  Female    F     Taylor Swift   FF
7     NaN  NaN          Alf pig     
8    Male  NaN     Dick Johnson    M
9  Female    F        Dana Jovy   FF

[10 rows x 4 columns]

我差点搞定了,但这可能效率不高

这是我想要的输出

>>> 
   Gender  Sex            Users
0     NaN    M          Al Gore
1    Male    M     Ned Flonders
2    Male    M      Kim jong un
3    Male    M      Al Sharpton
4     NaN    F          Michele
5     NaN  NaN  Richard Johnson
6  Female    F     Taylor Swift
7     NaN  NaN          Alf pig
8    Male    M     Dick Johnson
9  Female    F        Dana Jovy

[10 rows x 3 columns]

是否可以使用一些合并或更新功能来做到这一点?

【问题讨论】:

    标签: python merge pandas


    【解决方案1】:

    使用map:

    In [14]:
    
    import pandas as pd
    import numpy as np
    
    df=pd.DataFrame({'Users': [ 'Al Gore', 'Ned Flonders', 'Kim jong un', 'Al Sharpton', 'Michele', 'Richard Johnson', 'Taylor Swift', 'Alf pig', 'Dick Johnson', 'Dana Jovy'],
                     'Gender': [np.nan,'Male','Male','Male',np.nan,np.nan, 'Female',np.nan,'Male','Female'],
                     'Sex': ['M',np.nan,np.nan,'M','F',np.nan, 'F',np.nan,np.nan,'F']})
    
    In [15]:
    
    df
    
    Out[15]:
    
       Gender  Sex            Users
    0     NaN    M          Al Gore
    1    Male  NaN     Ned Flonders
    2    Male  NaN      Kim jong un
    3    Male    M      Al Sharpton
    4     NaN    F          Michele
    5     NaN  NaN  Richard Johnson
    6  Female    F     Taylor Swift
    7     NaN  NaN          Alf pig
    8    Male  NaN     Dick Johnson
    9  Female    F        Dana Jovy
    
    [10 rows x 3 columns]
    
    In [16]:
    
    # create a sex dict
    sex_map = {'Male':'M', 'Female':'F'}
    # update only those where sex is NaN, apply map to gender to fill in values
    df.loc[df.Sex.isnull(),'Sex'] = df['Gender'].map(sex_map)
    df
    
    Out[16]:
    
       Gender  Sex            Users
    0     NaN    M          Al Gore
    1    Male    M     Ned Flonders
    2    Male    M      Kim jong un
    3    Male    M      Al Sharpton
    4     NaN    F          Michele
    5     NaN  NaN  Richard Johnson
    6  Female    F     Taylor Swift
    7     NaN  NaN          Alf pig
    8    Male    M     Dick Johnson
    9  Female    F        Dana Jovy
    
    [10 rows x 3 columns]
    

    比较性能:

    In [21]:
    %timeit df['Sex2']=(df.Gender.isin(['Male']).map({True:'M',False:''}) + df.Sex.isin(['M']).map({True:'M',False:''}) + df.Sex.isin(['F']).map({True:'F',False:''})+                df.Gender.isin(['Female']).map({True:'F',False:''}))
    
    100 loops, best of 3: 2.38 ms per loop
    
    In [24]:
    %timeit df.loc[df.Sex.isnull(),'Sex'] = df['Gender'].map(sex_map)
    
    1000 loops, best of 3: 1.21 ms per loop
    
    In [27]:
    # without the NaN mask which is similar to what you are doing
    %timeit df['Sex'] = df['Gender'].map(sex_map)
    
    1000 loops, best of 3: 531 µs per loop
    

    所以在这个小样本上它更快,对于更大的数据帧,它应该更快,因为它使用 cython

    【讨论】:

    • 谢谢 Ed 有没有办法不区分大小写?
    • 您可以使用函数而不是 dict,首先使用小写/大写,或者只要您不期望有太多变体,就向 dict 添加不同的组合。
    • @ccsv 我添加了另一个示例,我们不使用布尔掩码,只设置性别列,这快了近 5 倍,所以我想如果你可以确保有一致的文本字符串或添加额外的如果您担心大小写混合,地图的键将优化此方法
    猜你喜欢
    • 2020-10-25
    • 2020-12-10
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-02-24
    • 2017-06-24
    • 1970-01-01
    相关资源
    最近更新 更多