当测试值和预测值之间的差异如此接近时，为什么我的 MSE 如此之高？答案

【问题标题】：Why is my MSE so high when the difference between test and prediction values are so close?当测试值和预测值之间的差异如此接近时，为什么我的 MSE 如此之高？
【发布时间】：2021-02-13 06:23:09
【问题描述】：

在 Python 中，我进行了一个小型多元线性回归模型，根据其他变量（所有这些变量都是百分比乘以 100）来解释该地区的房价，例如一个地区拥有学士学位的人的百分比、人口的百分比谁在家工作。我已经在 R 中进行了此操作，并且效果很好，但是我是 Python ML 的新手。我已经展示了y_pred = regressor.predict(X_test) 的输出和我得到的 MSE。我已经包含了我的数据样本，其中 avgincome PctSingleDetached 和 PctDrivetoWork 是 X，AvgHousingPrice 是 Y。

import matplotlib.pyplot as plt 
import pandas as pd 
from sklearn.impute import SimpleImputer

sample data:

      avgincome     PctSingleDetached   PctDrivetoWork    AvgHousingPrice 
0      44388.0          61.528497       81.151832          448954   
1      40650.0          54.372197       77.882798          349758  
2      43350.0          68.393782       79.553265          428740

X = hamiltondata.iloc[:, :-1].values
Y = hamiltondata.iloc[:, -1].values
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer. 
                         # Instructs to find missing and replace it with mean

# Fit method in SimpleImputer will connect imputer to our matrix of features                       
imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are strings
X[:, :] = imputer.transform(X[:,:])

# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
# X = np.array(ct.fit_transform(X))

print(X)
print(Y)


## Splitting into training and testing ##
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)

### Feature Scaling ###

from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formula
X_train[:, 0:] = sc.fit_transform(X_train[:,0:])
# Fit changes the data, Transform applies it! Here we have a method that does both

X_test[:, 0:] = sc.transform(X_test[:, 0:]) 

print(X_train)
print(X_test)

## Training ## 
from sklearn.linear_model import LinearRegression 

regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenient
regressor.fit(X_train, Y_train)

### Predicting Test Set results ###

y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimal
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything vertical

from sklearn.metrics import mean_squared_error 
mse = mean_squared_error(Y_test, y_pred)
print(mse)

OUTPUT: 
[[489066.76 300334.  ]
 [227458.2  200352.  ]
 [928249.59 946729.  ]
 [339032.27 350116.  ]
 [689668.21 600322.  ]
 [489179.58 577936.  ]]
...
...


MSE = 2375985640.8102403

【问题讨论】：

标签： python pandas numpy machine-learning scikit-learn

【解决方案1】：

您可以自己计算mse来检查是否有问题。在我看来，获得的结果是连贯的。无论如何，我用你的示例数据构建了一个简单的 my_mse 函数来检查 sklearn 输出的结果

from sklearn.metrics import mean_squared_error 

list_ = [[489066.76, 300334.], 
[227458.2,  200352.  ],
[928249.59, 946729.  ],
[339032.27, 350116.  ],
[689668.21, 600322.  ],
[489179.58, 577936.  ]]

y_true = [y[0] for y in list_]
y_pred = [y[1] for y in list_]

mse = mean_squared_error(y_true, y_pred)
print(mse)
# 8779930962.14985

def my_mse(y_true, y_pred):
  diff = 0
  for couple in zip(y_true, y_pred):
    diff+=pow(couple[0]-couple[1], 2)
  return diff/len(y_true)

print(my_mse(y_true, y_pred))
# 8779930962.14985

请记住，mse 是平均平方误差。（每个误差都是求和的平方）

如果您要问您的模型是好是坏，这取决于主要目标。无论如何，我认为你的模型表现不佳，因为它是一个线性模型。更复杂的模型可以处理问题并输出更好的结果

【讨论】：