【发布时间】:2020-01-19 11:30:56
【问题描述】:
我正在尝试预测体育比赛的结果,因此希望以可以训练模型的方式转换我的数据框。目前我正在使用 for 循环遍历所有玩过的游戏,选择游戏的两个玩家并检查他们在实际游戏发生之前如何执行 x 游戏。在此之后,我想取这些玩家之前比赛的统计数据的平均值并将它们连接在一起。最后,我添加了实际游戏的真实结果,以便根据真实结果训练模型。
现在我遇到了一些速度性能问题,我当前的代码大约需要 9 分钟才能完成 20000 场比赛(约 200 个变量)。我已经设法从 20 分钟缩短到 9 分钟。
我开始将每个游戏添加到一个数据框,后来我将其更改为将每个单独的数据框添加到一个列表中,最后制作这个列表的一个大数据框。 我还包括了 if 语句,以确保如果玩家至少没有玩 x 场游戏,循环会继续。
我预计结果会比 9 分钟快得多。我认为它可以更快。
希望大家能帮帮我!
import pandas as pd
import numpy as np
import random
import string
letters = list(string.ascii_lowercase)
datelist = pd.date_range(start='1/1/2017', end='1/1/2019')
data = pd.DataFrame({'Date':np.random.choice(datelist,5000),
'League': np.random.choice(['LeagueA','LeagueB'], 5000),
'Home_player':np.random.choice(letters, 5000),
'Away_player':np.random.choice(letters, 5000),
'Home_strikes':np.random.randint(1,20,5000),
'Home_kicks':np.random.randint(1,20,5000),
'Away_strikes':np.random.randint(1,20,5000),
'Away_kicks':np.random.randint(1,20,5000),
'Winner':np.random.randint(0,2,5000)})
leagues = list(data['League'].unique())
home_columns = [col for col in data if col.startswith('Home')]
away_columns = [col for col in data if col.startswith('Away')]
# Determine to how many last x games to take statistics
total_games = 5
final_df = []
# Make subframe of league
for league in leagues:
league_data = data[data.League == league]
league_data = league_data.sort_values(by='Date').reset_index(drop=True)
# Pick the last game
league_data = league_data.head(500)
for i in range(0,len(league_data)):
if i < 1:
league_copy = league_data.sort_values(by='Date').reset_index(drop=True)
else:
league_copy = league_data[:-i].reset_index(drop=True)
# Loop back from the last game
last_game = league_copy.iloc[-1:].reset_index(drop=True)
# Take home and away player
Home_player = last_game.loc[0,"Home_player"] # Pick home team
Away_player = last_game.loc[0,'Away_player'] # pick away team
# # Remove last row so current game is not picked
df = league_copy[:-1]
# Now check the statistics of the games befóre this game was played
Home = df[df.Home_player == Home_player].tail(total_games) # Pick data from home team
# If the player did not play at least x number of games, then continue
if len(Home) < total_games:
continue
else:
Home = Home[home_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Do the same for the away team
Away = df[df.Away_player == Away_player].tail(total_games) # Pick data from home team
if len(Away) < total_games:
continue
else:
Away = Away[away_columns].reset_index(drop=True) # Pick all columnnames that start with "Home"
# Now concat home and away player data
Home_away = pd.concat([Home, Away], axis=1)
Home_away.drop(['Away_player','Home_player'],inplace=True,axis=1)
# Take the mean of all columns
Home_away = pd.DataFrame(Home_away.mean().to_dict(),index=[0])
# Now again add home team and away team to dataframe
Home_away["Home_player"] = Home_player
Home_away["Away_player"] = Away_player
winner = last_game.loc[0,"Winner"]
date = last_game.loc[0,"Date"]
Home_away['Winner'] = winner
Home_away['Date'] = date
final_df.append(Home_away)
final_df = pd.concat(final_df, axis=0)
final_df = final_df[['Date','Home_player','Away_player','Home_kicks','Away_kicks','Home_strikes','Away_strikes','Winner']]
【问题讨论】:
-
始终包含带有随机数据的种子以重现值:
np.random.seed(###) -
这个问题应该在CodeReview 上提问,因为您需要优化整个脚本。 StackOverflow 有助于解决编码错误或不想要的结果。
标签: python pandas performance optimization filtering