如何提高 scikit-learn 中预测的准确性答案

【问题标题】：How to improve the accuracy of prediction in scikit-learn如何提高 scikit-learn 中预测的准确性
【发布时间】：2019-06-07 21:31:06
【问题描述】：

我想根据3 个特征 和1 个目标 预测一个参数。这是我的输入文件（data.csv）：

feature.1   feature.2   feature.3   target
    1           1          1        0.0625
    0.5         0.5        0.5      0.125
    0.25        0.25       0.25     0.25
    0.125       0.125      0.125    0.5
    0.0625      0.0625     0.0625   1

这是我的代码：

import pandas as pd
from sklearn.model_selection import train_test_split
from collections import *
from sklearn.linear_model import LinearRegression

features = pd.read_csv('data.csv')

features.head()
features_name = ['feature.1' , 'feature.2' , 'feature.3']
target_name = ['target']

X = features[features_name]
y = features[target_name]

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.1, random_state = 42)

linear_regression_model = LinearRegression()
linear_regression_model.fit(X_train,y_train)

#Here is where I want to predict the target value for these inputs for 3 features
new_data  = OrderedDict([('feature.1',0.375) ,('feature.2',0.375),('feature.3',0.375) ])

new_data = pd.Series(new_data).values.reshape(1,-1)
ss = linear_regression_model.predict(new_data)
print (ss)

根据趋势，如果我将 0.375 作为所有特征的输入，我预计会得到大约 0.1875 的值。然而，代码预测了这一点：

[[0.44203368]]

这是不正确的。我不知道问题出在哪里。有人知道我该如何解决吗？

谢谢

【问题讨论】：

您的所有训练数据点恰好具有所有 3 个特征相等，因此可能会引发共线性问题；如果总是这样，你应该摆脱除一个以外的所有功能。如果没有，你应该在你的训练集中包含一些不满足这个条件的数据点......
特征中的这种共线性会导致线性回归假设出现问题。

标签： python scikit-learn linear-regression prediction train-test-split

【解决方案1】：

您的数据不是线性的。由于特征相同，我只绘制了一个维度：

使用线性回归模型逼近非线性函数会产生糟糕的结果，就像您所经历的那样。您可以尝试建模一个更好的拟合函数并使用 scipy 拟合其参数：https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html

【讨论】：