将 model.predict() 的结果与原始 pandas DataFrame 合并？答案

【问题标题】：Merging results from model.predict() with original pandas DataFrame?将 model.predict() 的结果与原始 pandas DataFrame 合并？
【发布时间】：2017-04-05 09:05:34
【问题描述】：

我正在尝试将 predict 方法的结果与 pandas.DataFrame 对象中的原始数据合并回来。

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df['class'] = data.target

X = np.matrix(df.loc[:, [0, 1, 2, 3]])
y = np.array(df['class'])

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

要将这些预测与原始 df 合并回来，我试试这个：

df['y_hats'] = y_hats

但这引发了：

ValueError：值的长度与索引的长度不匹配

我知道我可以将df 拆分为train_df 和test_df 并且这个问题将得到解决，但实际上我需要按照上面的路径创建矩阵X 和y（我的实际问题是一个文本分类问题，我在拆分成训练和测试之前对整个特征矩阵进行了归一化）。我如何将这些预测值与我的df 中的适当行对齐，因为y_hats 数组是零索引的，并且似乎所有关于哪些行的信息都包含在X_test 和@ 987654335@丢了？还是我会被降级为先将数据帧拆分为训练测试，然后再构建特征矩阵？我只想用数据框中的np.nan 值填充train 中包含的行。

【问题讨论】：

我相信sklearn 支持DataFrames 和Series 作为train_test_split 的参数所以它应该通过传递你的df 的一个子部分来工作，除了返回的是索引所以您可以使用iloc 使用这些索引回您的 df，请参阅文档：scikit-learn.org/stable/modules/generated/…

标签： python pandas scikit-learn

【解决方案1】：

您的 y_hats 长度将只是测试数据上的长度 (20%)，因为您在 X_test 上进行了预测。一旦您的模型得到验证并且您对测试预测感到满意（通过检查模型在 X_test 预测上与 X_test 真实值相比的准确性），您应该在完整数据集 (X) 上重新运行预测。将这两行添加到底部：

y_hats2 = model.predict(X)

df['y_hats'] = y_hats2

编辑根据您的评论，这是一个更新的结果，它返回数据集，并在测试数据集中的位置附加了预测

from sklearn.datasets import load_iris
from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeClassifier
import pandas as pd
import numpy as np

data = load_iris()

# bear with me for the next few steps... I'm trying to walk you through
# how my data object landscape looks... i.e. how I get from raw data 
# to matrices with the actual data I have, not the iris dataset
# put feature matrix into columnar format in dataframe
df = pd.DataFrame(data = data.data)

# add outcome variable
df_class = pd.DataFrame(data = data.target)

# finally, split into train-test
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

# I've got my predictions now
y_hats = model.predict(X_test)

y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

【讨论】：

这并不能真正解决我只合并test 中的数据的问题。如果你合并每一行的预测，你怎么知道原始test矩阵中的哪些？据我所知，我可以运行您添加的行，但不知道模型是否已经看到 X 中的某些行（因此有点使训练测试的整个目的无效）。
@flyingmeatball 嗨，我正在尝试做同样的事情，但是当您将 y_hats 存储为变量时，它会变成一个 numpy 数组，而不是需要转换为 pandas 才能进行合并的数据框.此时，无法完成索引合并。我不确定我错过了什么？
y_test['preds'] = y_hats 导致此错误 [ValueError: 传递的项目数错误 2，位置暗示 1]

【解决方案2】：

我有同样的问题（几乎）

我是这样解决的

...
.
.
.
X_train, X_test, y_train, y_test = train_test_split(df,df_class, train_size = 0.8)

model = DecisionTreeClassifier()

model.fit(X_train, y_train)

y_hats = model.predict(X_test)

y_hats  = pd.DataFrame(y_hats)

df_out = X_test.reset_index()
df_out["Actual"] = y_test.reset_index()["Columns_Name"]
df_out["Prediction"] = y_hats.reset_index()[0]


y_test['preds'] = y_hats

df_out = pd.merge(df,y_test[['preds']],how = 'left',left_index = True, right_index = True)

【讨论】：

【解决方案3】：

您可以从 X_test 创建一个 y_hat 数据框复制索引，然后与原始数据合并。

y_hats_df = pd.DataFrame(data = y_hats, columns = ['y_hats'], index = X_test.index.copy())
df_out = pd.merge(df, y_hats_df, how = 'left', left_index = True, right_index = True)

注意，左连接将包括训练数据行。省略 'how' 参数将只得到测试数据。

【讨论】：

【解决方案4】：

试试这个：

y_hats2 = model.predict(X)
df[['y_hats']] = y_hats2

【讨论】：

欢迎来到 Stack Overflow。感谢您提供答案。我认为使用此article 可以进一步改进您的答案。您有机会为此添加更多上下文吗？

【解决方案5】：

您可能可以创建一个新数据框并将测试数据与预测值一起添加到其中：

data['y_hats'] = y_hats
data.to_csv('data1.csv')

【讨论】：

data['y_hats'] = y_hats 导致此错误 [ValueError: 传递的项目数错误 2，位置暗示 1]

【解决方案6】：

predicted = m.predict(X_valid)
predicted_df = pd.DataFrame(data=predicted, columns=['y_hat'], 
                            index=X_valid.index.copy())
df_out = pd.merge(X_valid, predicted_df, how ='left', left_index=True, 
                 right_index=True)

【讨论】：

【解决方案7】：

这对我来说效果很好。它维护索引位置。

pred_prob = model.predict(X_test) # calculate prediction probabilities
pred_class  = np.where(pred_prob >0.5, "Yes", "No") #for binary(Yes/No) category
predictions = pd.DataFrame(pred_class, columns=['Prediction'])
my_new_df = pd.concat([my_old_df, predictions], axis =1)

【讨论】：

【解决方案8】：

你也可以使用

y_hats = model.predict(X)

df['y_hats'] = y_hats.reset_index()['name of the target column']

【讨论】：