【发布时间】:2017-12-10 18:59:28
【问题描述】:
我刚开始在我的数据集上使用特征选择,我遇到了SelectFromModel 模块,它自动将原始的n x m 特征矩阵转换为n x k,其中k << m。但是,k 是先验未知的。
我想知道我应该如何使用它来训练模型,然后使用现有模型来预测新数据。众所周知,训练数据实例和测试数据实例必须用相同维度的特征向量表示。
但是这个维度会依赖于数据,不能用SelectFromModel控制。
我编写了如下代码:
X_train = ... # feature matrix
print("BEFORE FEATURE SELECTION, FEATURE MATRIX shape={}".format(X_train.shape))
# output of this line is 24771, 11680
select = SelectFromModel(LogisticRegression(class_weight='balanced',penalty="l1",C=0.01))
X_train = select.fit_transform(X_train, y_train)
print("AFTER FEATURE SELECTION, FEATURE MATRIX shape={}".format(M.shape))
# output of this line is 24771, 170
测试时,加载预训练好的模型,新的数据实例需要用相同的特征向量表示:
X_test = ... # feature matrix
# the next line maps test set features to the feature vectors observed on training data, using corresponding vocabularies
X_test=map_test_to_train_featurevectors(X_test, X_train)
print("BEFORE FEATURE SELECTION, FEATURE MATRIX shape={}".format(X_test.shape))
# output of this line is 550, 11680, so test instances has same vector dimension as training instances
select = SelectFromModel(LogisticRegression(class_weight='balanced',penalty="l1",C=0.01))
X_test = select.fit_transform(X_test, y_test)
print("AFTER FEATURE SELECTION, FEATURE MATRIX shape={}".format(M.shape))
# output of this line is 550, 5, but the pre-trained model will expect 170
best_estimator = util.load_classifier_model(model_file)
prediction_dev = best_estimator.predict_proba(X_test)
最后一行显然会产生如下错误,因为训练和测试时特征选择后得到的特征矩阵是不同的维度:
ValueError: X has 5 features per sample; expecting 169
是不是就不能这样使用 SelectFromModel 了?只能用于训练和评估吗?
【问题讨论】:
标签: python scikit-learn