【发布时间】:2021-09-04 04:40:03
【问题描述】:
我正在尝试清理名为 names2 的 Pandas 数据框。它由 599,864 行组成,其中 549,317 行是非空的。在相关列下的每一行中,'primary_profession' 有 1 个字符串、一个字符串数组或 NaN。
看看我是如何加载数据框的:
name_basics_imdb = pd.read_csv('imdb.name.basics.csv.gz')
names = name_basics_imdb
names2 = names.copy(deep=True)
(注意:我删除了一些列和行并重命名了一个列,如果您需要更多详细信息,我们很乐意提供)
这是 names2.info( ) 的视图
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 599865 entries, 0 to 599864
Data columns (total 4 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 nconst 599865 non-null object
1 primary_name 599865 non-null object
2 primary_profession 549317 non-null object
3 known_for_titles 569766 non-null object
dtypes: object(4)
memory usage: 18.3+ MB
names2.head()
nconst primary_name primary_profession known_for_titles
0 nm0061671 Mary Ellen Bauder miscellaneous,production_manager,producer tt0837562,tt2398241,tt0844471,tt0118553
1 nm0061865 Joseph Bauer composer,music_department,sound_department tt0896534,tt6791238,tt0287072,tt1682940
2 nm0062070 Bruce Baum miscellaneous,actor,writer tt0363631
3 nm0062195 Axel Baumann camera_department,cinematographer,art_department tt0114371,tt2004304,tt1618448,tt1224387
4 nm0062798 Pete Baxter production_designer,art_department,set_decorator tt0452644,tt0452692,tt34580
目标是遍历每一行的字符串、字符串或 Nan,并将行保留在 writer、writer director 或仅是 director。任何其他职业都可以被淘汰。例如,在第 2 行中:miscellaneous、actor、writer 在 primary_profession 列中,miscellaneous 和 actor 可以消除,只留下 writer 在那一行。
可以删除任何没有 writer 或 director 或包含 NaN 的行。
这是我做的一些尝试
#inverse filtering
value_list = ['miscellaneous', 'production_manager', 'composer', 'music_department', 'sound_department',
'miscellaneous', 'actor', '...', 'costume_department', 'costume_designer', 'actress', 'art_director', 'music_department' ]
#have to split the arrays first
inverse_bool_series = ~names2.primary_profession.isin(value_list)
names2_filtered = names2[inverse_bool_series]
names2_filtered
我也试过
names2['primary_profession'] = names2['primary_profession'].str.split(",").str[:3]
names2['primary_profession']
(names2['primary_profession'][0])
type(names2['primary_profession'][0][0])
然后就是这个
for index, row in names2.iterrows():
idx = list(len(range(names2.primary_profession)))
for i in idx:
print(row['primary_profession'][i])
总而言之,目标是数据框names2 仅包含职业作家、作家导演或仅作家
【问题讨论】:
标签: python arrays pandas string dataframe