【问题标题】:Change data in pandas column更改 pandas 列中的数据
【发布时间】:2025-12-10 17:00:01
【问题描述】:

我在更改我的 pandas 数据框(大约 51000x11 大小)的一列中的数据时遇到问题。

import pandas as pd
import numpy as np

df_answers.head(10)

set(df_answers['Gender'])

“性别”列中有 29 种答案

{'Female',
 'Female; Gender non-conforming',
 'Female; Gender non-conforming; Other',
 'Female; Other',
 'Female; Transgender',
 'Female; Transgender; Gender non-conforming',
 'Female; Transgender; Gender non-conforming; Other',
 'Female; Transgender; Other',
 'Gender non-conforming',
 'Gender non-conforming; Other',
 'Male',
 'Male; Female',
 'Male; Female; Gender non-conforming',
 'Male; Female; Gender non-conforming; Other',
 'Male; Female; Other',
 'Male; Female; Transgender',
 'Male; Female; Transgender; Gender non-conforming',
 'Male; Female; Transgender; Gender non-conforming; Other',
 'Male; Female; Transgender; Other',
 'Male; Gender non-conforming',
 'Male; Gender non-conforming; Other',
 'Male; Other',
 'Male; Transgender',
 'Male; Transgender; Gender non-conforming',
 'Male; Transgender; Other',
 'Other',
 'Transgender',
 'Transgender; Gender non-conforming',
 'Transgender; Other',
 nan}

我想改变这个烂摊子 - 留下 2 个选项“女性”、“男性”,然后将其他所有内容更改(替换)为“其他”。 不幸的是,我在下面写的函数不起作用 - 我怀疑 .isin() 或 .loc[] 可能有问题,但我不确定。

def change_gender_name():
    if (df_answers.loc[~df_answers['Gender'].isin(['Female', 'Male'])]):
        df_answers['Gender'] = df_answers['Gender'].str.replace('*', 'Other', regex=True, inplace=True)
    else:
        pass

change_gender_name()

ValueError:DataFrame 的真值不明确。使用 a.empty、a.bool()、a.item()、a.any() 或 a.all()。

感谢您的宝贵时间。

我放了一些额外的信息,因为我认为这不是一件容易的事。

当列中的数据等于“女性”、“男性”或“其他”时(里面没有任何附加词),我想保持原样;我想将所有 26 种不同类型的数据更改为“其他”字符串。

'Female'、'Male'、'Other' - 它们是这里面的最终答案 专栏

【问题讨论】:

  • df['new_gender'] = np.where((df['Gender'] != 'Male') | (df['Gender'] != 'Female'), 'Other', df['Gender'])
  • 把它放在函数内部还是作为代码的独立部分(没有函数)?
  • 你不需要函数。只需将其作为自己的行运行即可。
  • df['New_Gender'] = df['Gender'].str.replace('^(?!.*Female)(?!.*Male).*', 'Other')?
  • @KrzysztofSobota 你是对的,我在逻辑上犯了一个错误:df['new_gender'] = np.where(df['Gender'].isin(['Male', 'Female']), df['Gender'], 'Other')

标签: python pandas


【解决方案1】:

这里有一个替代方案:

import pandas as pd
import numpy as np

m = {'Female',
 'Female; Gender non-conforming',
 'Female; Gender non-conforming; Other',
 'Female; Other',
 'Female; Transgender',
 'Female; Transgender; Gender non-conforming',
 'Female; Transgender; Gender non-conforming; Other',
 'Female; Transgender; Other',
 'Gender non-conforming',
 'Gender non-conforming; Other',
 'Male',
 'Male; Female',
 'Male; Female; Gender non-conforming',
 'Male; Female; Gender non-conforming; Other',
 'Male; Female; Other',
 'Male; Female; Transgender',
 'Male; Female; Transgender; Gender non-conforming',
 'Male; Female; Transgender; Gender non-conforming; Other',
 'Male; Female; Transgender; Other',
 'Male; Gender non-conforming',
 'Male; Gender non-conforming; Other',
 'Male; Other',
 'Male; Transgender',
 'Male; Transgender; Gender non-conforming',
 'Male; Transgender; Other',
 'Other',
 'Transgender',
 'Transgender; Gender non-conforming',
 'Transgender; Other',
 np.nan}

df_answers = pd.DataFrame(m)
df_answers.columns=['Gender']

那是为了重现问题。 这里有几个功能 此功能有助于选择正确的答案并将其他答案放在一边:

def change_words(y):
    if y.__contains__('Male') | y.__contains__('Female'):
        return y
    else:
        return 'Other'

请随意改用这个:

def change_words_v2(y):
    if y in ['Male','Female']:
        return y
    else:
        return 'Other'

这个功能是涵盖每一个特定的情况

def simplify_gender(x):
    new_x = []
    for y in str(x).split('; '):
        new_x.append(change_words(y))
    return '; '.join(new_x)

以这种方式在一起:

df_answers.applymap(lambda x: simplify_gender(x))
                               Gender
0                               Other
1          Male; Female; Other; Other
2          Male; Female; Other; Other
3                 Male; Female; Other
4         Female; Other; Other; Other
5                         Male; Other
6                         Male; Other
7                       Female; Other
8                        Other; Other
9                        Other; Other
10                Male; Female; Other
11                 Male; Other; Other
12                       Other; Other
13               Female; Other; Other
14                       Male; Female
15                              Other
16         Male; Female; Other; Other
17                              Other
18                        Male; Other
19                      Female; Other
20                 Male; Other; Other
21                 Male; Other; Other
22                              Other
23                             Female
24                      Female; Other
25                               Male

【讨论】:

    【解决方案2】:

    我建议语法:

     df[df['filtered_column']=='filter'] = 'inserted_value'
    

    在您的情况下,它类似于:

    s = df_answers['Gender']
    s[~s['Gender'].isin(['Female','Male'])] = 'Other'
    df_answers['Gender'] = s
    

    下面是@PaulH 在 cmets 中提出的语法,实际上是一个更好的解决方案。更具可读性:

    df.loc[~df['Gender'].isin(['Female','Male']), 'Gender'] = 'Other'
    

    【讨论】:

    • 在赋值时不要链接你的索引器
    • :1: SettingWithCopyWarning: 试图在 DataFrame 中的切片副本上设置值
    • @PaulH 好点,它会导致“SettingWithCopyWarning”。可以在此之前对分离的系列进行。我正在修改答案。
    • 我会使用 .loc 访问器:df.loc[~df['Gender'].isin(['Female','Male']), 'Gender'] = 'Other'
    • 除了提供更好的可读性之外,它是否提高了性能?