【发布时间】:2016-09-15 10:16:51
【问题描述】:
Kaggle 在泰坦尼克号上有一个介绍性的数据科学问题,其目标是根据乘客的一些信息(例如性别、年龄、舱位等级等)预测乘客的生还机会。我使用 Scikit Learn 在 Python 中为此实现了一个简单的逻辑回归模型,并且我正在探索特别是添加“年龄”变量的更高阶因子。我按照 Scikit Learn 网站上的说明使用了 PolynomialFeatures:
import pandas as pd
from sklearn import linear_model
import numpy as np
from sklearn import preprocessing
from sklearn.preprocessing import PolynomialFeatures
# Import titanic data
titanic = pd.read_csv("train.csv")
# Set the training set as 70% of the dataset and cross_validation set as remaining 30%
predictors = ["Age"]
training_set = titanic[predictors].iloc[range(0,int(titanic.shape[0]*0.7)),:]
cv_set = titanic[predictors].iloc[range(int(titanic.shape[0]*0.7),titanic.shape[0]),:]
training_actuals = titanic["Survived"].iloc[range(0,int(titanic.shape[0]*0.7))]
cv_actuals = titanic["Survived"].iloc[range(int(titanic.shape[0]*0.7),titanic.shape[0])]
# Create polynomial features
poly = PolynomialFeatures(degree=3)
training_set = poly.fit_transform(training_set)
cv_set = poly.fit_transform(cv_set)
# Fit a logistic regression model, predict values for training and cross-validation sets
alg = linear_model.LogisticRegression()
alg.fit(training_set, training_actuals)
cv_predictions = alg.predict(cv_set)
training_predictions = alg.predict(training_set)
# Measure and print accuracy of prediction over both training and cross-validation sets
cv_accuracy = len(cv_predictions[cv_predictions == np.array(cv_actuals)])/float(len(cv_predictions))
print "Prediction accuracy on cross-validation set is %s%%" % (cv_accuracy * 100)
training_accuracy = len(training_predictions[training_predictions == np.array(training_actuals)])/float(len(training_predictions))
print "Prediction accuracy on training set is %s%%" % (training_accuracy * 100)
当我为年龄添加平方特征(即多项式次数 2)时,我在训练集上的预测值的准确度提高了 1-2 个百分点,但是当我将次数设为 3 时,如上面的代码所示,准确度实际上恢复到与线性情况相同的情况(即度数 = 1)。从理论上讲,它应该略有改善或与度数 = 2 保持相同。这种行为也适用于所有更高的度数。我对 Scikit Learn 非常陌生,如果我能了解我做错了什么,我将不胜感激。
【问题讨论】:
-
"理论上应该会改进..." -- 你的意思是直觉上你认为应该会改进。重要的是不要混淆两者:)
标签: python scikit-learn