【发布时间】:2025-12-17 22:10:01
【问题描述】:
我有两个数据框,一个是演员的 df,他们有一个特征是他们制作的电影的电影标识符号列表。我还有一个电影列表,其中包含一个标识符号,如果演员在那部电影中,它将显示在演员的列表中。
我尝试遍历电影数据帧,它确实会产生结果,但速度太慢。
似乎从演员数据帧中迭代电影列表会减少循环,但我无法保存结果。
这是演员数据框:
print(actors[['primaryName', 'knownForTitles']].head())
primaryName knownForTitles
0 Rowan Atkinson tt0109831,tt0118689,tt0110357,tt0274166
1 Bill Paxton tt0112384,tt0117998,tt0264616,tt0090605
2 Juliette Binoche tt1219827,tt0108394,tt0116209,tt0241303
3 Linda Fiorentino tt0110308,tt0119654,tt0088680,tt0120655
4 Richard Linklater tt0243017,tt1065073,tt2209418,tt0405296
电影数据框:
print(movies[['tconst', 'primaryTitle']].head())
tconst primaryTitle
0 tt0001604 The Fatal Wedding
1 tt0002467 Romani, the Brigand
2 tt0003037 Fantomas: The Man in Black
3 tt0003593 Across America by Motor Car
4 tt0003830 Detective Craig's Coup
如您所见,movies['tconst'] 标识符显示在参与者数据框中的列表中。
我通过电影数据帧的非常缓慢的迭代如下:
def add_cast(movie_df, actor_df):
results = movie_df.copy()
length = len(results)
#create an empty feature
results['cast'] = ""
#iterate through the movie identifiers
for index, value in results['tconst'].iteritems():
#create a new dataframe containing all the cast associated with the movie id
cast = actor_df[actor_df['knownForTitles'].str.contains(value)]
#check to see if the 'primaryName' list is empty
if len(list(cast['primaryName'].values)) != 0:
#set the new movie 'cast' feature equal to a list of the cast names
results.loc[index]['cast'] = list(cast['primaryName'].values)
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
#delete cast df to free up memory
del cast
return results
这会产生一些结果,但速度不够快,无法派上用场。一个观察结果是,通过为所有在其knownForTitles 中具有电影标识符的演员创建一个新数据框,可以将该列表放入电影数据框的单个特征中。
虽然我尝试循环遍历下面的演员数据帧,但我似乎无法将项目附加到电影数据帧中:
def actors_loop(movie_df, actor_df):
results = movie_df.copy()
length = len(actor_df)
#create an empty feature
results['cast'] = ""
#iterate through all actors
for index, value in actor_df['knownForTitles'].iteritems():
#skip empties
if str(value) == r"\N":
logging.warning(f'skipping: {index} with a value of {value}')
continue
#generate a list of movies that this actor has been in
cinemetography = [x.strip() for x in value.split(',')]
#iterate through every movie the actor has been in
for movie in cinemetography:
#pull out the movie info if it exists
movie_info = results[results['tconst'] == movie]
#continue if empty
if len(movie_info) == 0:
continue
#set the cast variable equal to the actor name
results[results['tconst'] == movie]['cast'] = (actor_df['primaryName'].loc[index])
#delete the df to save space ?maybe
del movie_info
#logging
if index % 1000 == 0:
logging.warning(f'Results location: {index} out of {length}')
return results
所以如果我运行上面的代码,我会得到一个非常快的结果,但“cast”字段仍然是空的。
【问题讨论】:
-
在示例中,演员
knownForTitles中没有电影中的tconst -
是的,没错。匹配很少见,但我执行了几个包含明确匹配的测试用例。我应该包含该代码吗?
-
不,只需编辑输入数据以使它们同步,如果可能的话也添加预期的数据。 :) bdw 也检查一下:*.com/questions/56689519/… 我想你也有同样的问题
标签: python pandas list loops dataframe