机器学习分类：将预测列附加/附加到具有匹配索引号的原始数据集答案

【问题标题】：Machine learning Classification: Appending/ Attaching Prediction Column to original sataset with matching Index number机器学习分类：将预测列附加/附加到具有匹配索引号的原始数据集
【发布时间】：2020-09-28 00:23:35
【问题描述】：

我正在构建一个机器学习二元分类模型，我想将我的prediction column 附加到原始数据集df，以便能够将原始预测与基本事实target column 与匹配index 进行比较。我的困境是——在 ML 中，数据集通常在分成Train、Validation 和Test 集之前被打乱/随机化。见下文：

原始数据集df

df

applicant_id, income, age, level_of_education, home_owner, gender, target
1001, 32400, 21, 0, 0, M, 0
1024, 76221, 46, 1, 1, F, 1
1706, 231000, 56, 3, 1, M, 1
1008, 38115, 48, 0, 1, M, 1 
.
.
.
.
9999, 47820, 37, 2, 0, F, 0

在对train_test_split 或createDataPartition 进行分区后，数据的序列被打乱和随机化以防止过拟合。所以看起来是这样的（注意applicant_id列的顺序）

Train_df(Combined: X_train, y_train)

applicant_id, income, age, level_of_education, home_owner, gender, target
1001, 32400, 21, 0, 0, M, 0
9999, 47820, 37, 2, 0, F, 0
.
.
.
.
1008, 38115, 48, 0, 1, M, 1 

test_df (Combined: X_test, y_test)

applicant_id, income, age, level_of_education, home_owner, gender, target
1024, 76221, 46, 1, 1, F, 1
1706, 231000, 56, 3, 1, M, 1

我想要的输出：

#Key thing: I want to be able to track/trace and compare the the `target_label` with the `pred_label` in 
#the dataframe while maintain the `original sequence\index of the applicant_id`. 

#Lastly, I will like to know what row/record went to `train`, `val` and `test`  as seen in `final_df`

Final_df

applicant_id, income, age, level_of_education, home_owner, gender, target, pred_label, split_class
1001, 32400, 21, 0, 0, M, 0, 0, Train
1024, 76221, 46, 1, 1, F, 1, 1, Test
1706, 231000, 56, 3, 1, M, 1, 0, Test
1008, 38115, 48, 0, 1, M, 1, 1, Val
.
.
.
9999, 47820, 37, 2, 0, F, 0, 0, Train

这是我的代码

# libraries

import pandas as pd
from keras.models import Sequential
from keras.layers import Dense

# load the dataset
df = pd.read_csv('data.csv', delimiter=',')
# split into input (X) and output (y) variables
X = df[:,0:7]
y = df[:,7]

# Data partition/Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)

# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))

# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# fit the keras model on the dataset
history =model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)

# make class predictions with the model
y_pred = model.predict(X_test)
predictions= (y_pred > 0.5)


df['pred_label'] = predictions # this returns only for `test_set`, I want for ` train_set` and     `val_set` as well so I can combine in the `original_df`

【问题讨论】：

如果您在训练数据上预测同一模型用于训练的数据，这不是有偏差吗？这些预测无论如何都没有用，只要你想通过简单的 model.predict 在 X_train 上进行预测
@UjjwalAgrawal 我预测测试因此y_pred = model.predict(X_test)。实际上我更感兴趣的是在test 集合中跟踪applicant_id 的`index`
也许How to fill NaN values by imputation, in the Titanic Age column? 会帮助你。
试车拆分后为什么不保存申请者ID不行？？

标签： python pandas numpy keras

【解决方案1】：

您可以在训练/测试拆分之前创建一个“索引”列。拆分后，您拥有带有索引列的训练/测试集。不要在训练和测试中使用索引列。输出的顺序将与输入相同，然后您可以将输出设置为具有索引列的数据框，这样您就不会丢失。下面，我试图解释其中的逻辑。根据您的 k-fold，我不知道您打算如何组合结果。如果您只训练一次模型，那么将整个数据集提供给 predict 就足够了。

X['index_df'] = X.index
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X[['chosen_features_without_index']], y, test_size=0.3, random_state=0)
history =model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)

【讨论】：