使用 sklearn 和 pandas 改进线性回归的 POC答案

【问题标题】：Improve a POC of linear regression with sklearn and pandas使用 sklearn 和 pandas 改进线性回归的 POC
【发布时间】：2018-07-30 02:39:05
【问题描述】：

基本上，我在线性回归模型上部署概念验证，以根据特定数据集验证准确度系数百分比。对于以前构建的高级模型，我在我的数据集中应用了一种操作，以确保作为输入所需的所有列都是数字且正常。

数据集概览显示所有列都是数字且格式正确。 预测因素：

目标：

我运行了一个描述来获取更多详细信息并再次验证值。（红色预测器和黄色目标）

部署模型：

# split training and test
X_train, X_test,y_train,y_test = train_test_split (X,y,test_size=0.80,random_state = 33)

# Apply the scaler
scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train.reshape(-1,1))
X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train.reshape(-1,1))

# split the tragets in training/test
X_test = scalerX.transform(X_test)
y_test = scalery.transform(y_test.reshape(-1,1))

# Create model linear regression
clf_sgd = linear_model.SGDRegressor(loss='squared_loss',penalty=None,random_state=33)
#clf_sgd = LinearRegression()

# Learning based in the model
clf_sgd.fit(X_train,y_train.ravel())
print("Coefficient de determination:",clf_sgd.score(X_train,y_train))
# Model performance
y_pred = clf_sgd.predict(X_test)
print("Coefficient de determination:{0:.3f}".format(metrics.r2_score(y_test,y_pred)))

不幸的是，我的结果非常糟糕，非常糟糕。

我期待聆听并收集有关如何改进我的模型的想法，我在这个领域没有太多经验。非常感谢。

【问题讨论】：

标签： python pandas machine-learning scikit-learn

【解决方案1】：

有两点可以改进：

1) 您需要正确配置线性模型的超参数。 scikit-learn SGDRegressor 对几个参数的值选择非常敏感，其中最重要的是 alpha、penalty、loss 和 max_iter。环顾四周并尝试了解一种称为交叉验证的技术，并使用它来确定给定数据的这些参数的合理值。

2) 除非在非常特殊的情况下，您实际上不需要缩放目标变量y

【讨论】：