如何遍历熊猫数据框中的字符串并删除不需要的单词？答案

【问题标题】：How to iterate over strings in pandas dataframe and remove unwanted words?如何遍历熊猫数据框中的字符串并删除不需要的单词？
【发布时间】：2021-09-04 04:40:03
【问题描述】：

我正在尝试清理名为 names2 的 Pandas 数据框。它由 599,864 行组成，其中 549,317 行是非空的。在相关列下的每一行中，'primary_profession' 有 1 个字符串、一个字符串数组或 NaN。

看看我是如何加载数据框的：

name_basics_imdb = pd.read_csv('imdb.name.basics.csv.gz')
names = name_basics_imdb
names2 = names.copy(deep=True)

（注意：我删除了一些列和行并重命名了一个列，如果您需要更多详细信息，我们很乐意提供）

这是 names2.info( ) 的视图

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599865 entries, 0 to 599864
Data columns (total 4 columns):
 #   Column              Non-Null Count   Dtype 
---  ------              --------------   ----- 
 0   nconst              599865 non-null  object
 1   primary_name        599865 non-null  object
 2   primary_profession  549317 non-null  object
 3   known_for_titles    569766 non-null  object
dtypes: object(4)
memory usage: 18.3+ MB

names2.head()

    nconst       primary_name                  primary_profession                         known_for_titles
0   nm0061671   Mary Ellen Bauder   miscellaneous,production_manager,producer   tt0837562,tt2398241,tt0844471,tt0118553
1   nm0061865   Joseph Bauer       composer,music_department,sound_department   tt0896534,tt6791238,tt0287072,tt1682940
2   nm0062070   Bruce Baum                miscellaneous,actor,writer            tt0363631
3   nm0062195   Axel Baumann    camera_department,cinematographer,art_department    tt0114371,tt2004304,tt1618448,tt1224387
4   nm0062798   Pete Baxter production_designer,art_department,set_decorator    tt0452644,tt0452692,tt34580

目标是遍历每一行的字符串、字符串或 Nan，并将行保留在 writer、writer director 或仅是 director。任何其他职业都可以被淘汰。例如，在第 2 行中：miscellaneous、actor、writer 在 primary_profession 列中，miscellaneous 和 actor 可以消除，只留下 writer 在那一行。

可以删除任何没有 writer 或 director 或包含 NaN 的行。

这是我做的一些尝试

#inverse filtering

value_list = ['miscellaneous', 'production_manager', 'composer', 'music_department', 'sound_department',
      'miscellaneous', 'actor', '...', 'costume_department', 'costume_designer', 'actress', 'art_director', 'music_department' ]

#have to split the arrays first

inverse_bool_series = ~names2.primary_profession.isin(value_list)
names2_filtered = names2[inverse_bool_series]
names2_filtered

我也试过

names2['primary_profession'] = names2['primary_profession'].str.split(",").str[:3]
names2['primary_profession']
(names2['primary_profession'][0])
type(names2['primary_profession'][0][0])

然后就是这个

for index, row in names2.iterrows():
    idx = list(len(range(names2.primary_profession)))
    for i in idx:
        print(row['primary_profession'][i])

总而言之，目标是数据框names2 仅包含职业作家、作家导演或仅作家

的行

【问题讨论】：

标签： python arrays pandas string dataframe

【解决方案1】：

unwanted_row_indices=[]
for index,row in names2.iterrows():
    if('writer' in row['primary_profession'].lower()):
        pass
    else:
        unwanted_row_indices.append(index)
names2=names2.drop(unwanted_row_indices,axis=0)

【讨论】：

循环可能不是一种有效的方法，特别是如果它是一个大的df
感谢您与我们联系。我尝试了代码，它抛出了这个错误AttributeError: 'str' object has no attribute 'contains'不确定我是否做错了什么
非常感谢您提出问题。我的错，我已经在 if 语句中进行了更正。请进行更改，如果可行，请告诉我。
至于解决数据清洗问题，你成功了。但是，经过一番调查，我将不得不同意@Ade_1 的观点，即循环并不是在 Pandas 中解决此问题的最有效方法。 towardsdatascience.com/… 查看我发现的这篇文章，它可以比我更好地解释它。再次感谢您的回答。

【解决方案2】：

In [96]: names2
Out[96]:
      nconst       primary_name                                         primary_profession                         known_for_titles
0  nm0061671  Mary Ellen Bauder  miscellaneous,production_manager,producer,director,writer  tt0837562,tt2398241,tt0844471,tt0118553
1  nm0061865       Joseph Bauer                 composer,music_department,sound_department  tt0896534,tt6791238,tt0287072,tt1682940
2  nm0062070         Bruce Baum                                 miscellaneous,actor,writer                                tt0363631
3  nm0062195       Axel Baumann           camera_department,cinematographer,art_department  tt0114371,tt2004304,tt1618448,tt1224387
4  nm0062798        Pete Baxter           production_designer,art_department,set_decorator              tt0452644,tt0452692,tt34580

In [97]: profs = names2['primary_profession'].str.split(',').explode()

In [98]: profs
Out[98]:
0          miscellaneous
0     production_manager
0               producer
0               director
0                 writer
1               composer
1       music_department
1       sound_department
2          miscellaneous
2                  actor
2                 writer
3      camera_department
3        cinematographer
3         art_department
4    production_designer
4         art_department
4          set_decorator
Name: primary_profession, dtype: object

In [99]: filtered_profs = profs[profs.isin(['writer', 'writer director', 'director'])]

In [100]: filtered_profs.groupby(filtered_profs.index).agg(','.join)
Out[100]:
0    director,writer
2             writer
Name: primary_profession, dtype: object

In [101]: names2.drop('primary_profession', axis=1).join(filtered_profs.groupby(filtered_profs.index).agg(','.join), how='inner')
Out[101]:
      nconst       primary_name                         known_for_titles primary_profession
0  nm0061671  Mary Ellen Bauder  tt0837562,tt2398241,tt0844471,tt0118553    director,writer
2  nm0062070         Bruce Baum                                tt0363631             writer

【讨论】：

先生，如果我在你身边，我会和你握手。这段代码不仅可以工作并且以完美的方式布局，而且它回答了我遇到的不同层次的问题。所以你今晚帮助了一个初学者的程序员进步。谢谢@Asish M。

【解决方案3】：

# set the words you want to match.
matched_words = ['writer', 'writer_director', 'director']

#drop rows which has nan in column 'primary_profession'
names2 = names.dropna(axis='index', subset=['primary_profession'])

#extract all matched words
names2_extractall = names2['primary_profession'].str.extractall(rf'({"|".join(matched_words)})')

#groupby index and join those matches result by ','
mod_prof = names2_extractall.groupby(level=0).apply(lambda x: ",".join(x.iloc[:, 0]))

#assign to column 'primary_profession'
names2 = names2.assign(primary_profession=mod_prof)

#drop no matched rows
names2 = names2.dropna(axis='index', subset=['primary_profession'])

【讨论】：

感谢您的回答。我在上面尝试了您的代码，它很好地清理了 primary_profession 列。在 modified_profession 列下确实有一些 NaN，但只有作家没有导演。再说一次，我想不通的原因。
我不知道你需要匹配哪些词（matched_words = ['??','xx',...]），根据你的需要修改它。我应该再次删除 NaN，这取决于 modified_profession，感谢您指出。
很难说。当我运行 len(names2.primary_profession.unique()) 时，它返回 8627。因此，我不会筛选所有这些值，而是尝试 Ashish M 的答案。
如果有我遗漏的步骤，请告诉我。
@NateBates 是的，比 Ashish M. 的回答慢 1.59 ms ± 4.63 µs 每个循环（平均值±标准偏差。7 次运行，每个循环 1000 个循环）2.83 ms ± 44.5 µs 每个循环（ 7 次运行的平均值 ± 标准偏差，每次 1000 次循环）