【发布时间】:2020-09-28 00:23:35
【问题描述】:
我正在构建一个机器学习二元分类模型,我想将我的prediction column 附加到原始数据集df,以便能够将原始预测与基本事实target column 与匹配index 进行比较。我的困境是——在 ML 中,数据集通常在分成Train、Validation 和Test 集之前被打乱/随机化。见下文:
原始数据集df
df
applicant_id, income, age, level_of_education, home_owner, gender, target
1001, 32400, 21, 0, 0, M, 0
1024, 76221, 46, 1, 1, F, 1
1706, 231000, 56, 3, 1, M, 1
1008, 38115, 48, 0, 1, M, 1
.
.
.
.
9999, 47820, 37, 2, 0, F, 0
在对train_test_split 或createDataPartition 进行分区后,数据的序列被打乱和随机化以防止过拟合。所以看起来是这样的(注意applicant_id列的顺序)
Train_df(Combined: X_train, y_train)
applicant_id, income, age, level_of_education, home_owner, gender, target
1001, 32400, 21, 0, 0, M, 0
9999, 47820, 37, 2, 0, F, 0
.
.
.
.
1008, 38115, 48, 0, 1, M, 1
test_df (Combined: X_test, y_test)
applicant_id, income, age, level_of_education, home_owner, gender, target
1024, 76221, 46, 1, 1, F, 1
1706, 231000, 56, 3, 1, M, 1
我想要的输出:
#Key thing: I want to be able to track/trace and compare the the `target_label` with the `pred_label` in
#the dataframe while maintain the `original sequence\index of the applicant_id`.
#Lastly, I will like to know what row/record went to `train`, `val` and `test` as seen in `final_df`
Final_df
applicant_id, income, age, level_of_education, home_owner, gender, target, pred_label, split_class
1001, 32400, 21, 0, 0, M, 0, 0, Train
1024, 76221, 46, 1, 1, F, 1, 1, Test
1706, 231000, 56, 3, 1, M, 1, 0, Test
1008, 38115, 48, 0, 1, M, 1, 1, Val
.
.
.
9999, 47820, 37, 2, 0, F, 0, 0, Train
这是我的代码
# libraries
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
# load the dataset
df = pd.read_csv('data.csv', delimiter=',')
# split into input (X) and output (y) variables
X = df[:,0:7]
y = df[:,7]
# Data partition/Split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=0)
# define the keras model
model = Sequential()
model.add(Dense(12, input_dim=8, activation='relu'))
model.add(Dense(8, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
# compile the keras model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# fit the keras model on the dataset
history =model.fit(X_train, y_train, epochs=150, batch_size=10, verbose=0)
# make class predictions with the model
y_pred = model.predict(X_test)
predictions= (y_pred > 0.5)
df['pred_label'] = predictions # this returns only for `test_set`, I want for ` train_set` and `val_set` as well so I can combine in the `original_df`
【问题讨论】:
-
如果您在训练数据上预测同一模型用于训练的数据,这不是有偏差吗?这些预测无论如何都没有用,只要你想通过简单的 model.predict 在 X_train 上进行预测
-
@UjjwalAgrawal 我预测测试因此
y_pred = model.predict(X_test)。实际上我更感兴趣的是在test集合中跟踪applicant_id的`index` -
试车拆分后为什么不保存申请者ID不行??