【问题标题】:Why is my MSE so high when the difference between test and prediction values are so close?当测试值和预测值之间的差异如此接近时,为什么我的 MSE 如此之高?
【发布时间】:2021-02-13 06:23:09
【问题描述】:

在 Python 中,我进行了一个小型多元线性回归模型,根据其他变量(所有这些变量都是百分比乘以 100)来解释该地区的房价,例如一个地区拥有学士学位的人的百分比、人口的百分比谁在家工作。我已经在 R 中进行了此操作,并且效果很好,但是我是 Python ML 的新手。我已经展示了y_pred = regressor.predict(X_test) 的输出和我得到的 MSE。我已经包含了我的数据样本,其中 avgincome PctSingleDetachedPctDrivetoWork 是 X,AvgHousingPrice 是 Y。

import matplotlib.pyplot as plt 
import pandas as pd 
from sklearn.impute import SimpleImputer

sample data:

      avgincome     PctSingleDetached   PctDrivetoWork    AvgHousingPrice 
0      44388.0          61.528497       81.151832          448954   
1      40650.0          54.372197       77.882798          349758  
2      43350.0          68.393782       79.553265          428740

X = hamiltondata.iloc[:, :-1].values
Y = hamiltondata.iloc[:, -1].values
imputer = SimpleImputer(missing_values = np.nan, strategy = 'mean') # This is an object of the imputer class. It will help us find that average to infer. 
                         # Instructs to find missing and replace it with mean

# Fit method in SimpleImputer will connect imputer to our matrix of features                       
imputer.fit(X[:,:]) # We exclude column "O" AKA Country because they are strings
X[:, :] = imputer.transform(X[:,:])

# from sklearn.compose import ColumnTransformer
# from sklearn.preprocessing import OneHotEncoder
# ct = ColumnTransformer(transformers = [('encoder', OneHotEncoder(), [0])], remainder = 'passthrough')
# X = np.array(ct.fit_transform(X))

print(X)
print(Y)


## Splitting into training and testing ##
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X,Y,test_size = 0.2, random_state = 0)

### Feature Scaling ###

from sklearn.preprocessing import StandardScaler
sc = StandardScaler() # this does STANDARDIZATION for you. See data standardization formula
X_train[:, 0:] = sc.fit_transform(X_train[:,0:])
# Fit changes the data, Transform applies it! Here we have a method that does both

X_test[:, 0:] = sc.transform(X_test[:, 0:]) 

print(X_train)
print(X_test)

## Training ## 
from sklearn.linear_model import LinearRegression 

regressor = LinearRegression() # This class takes care of selecting the best variables. Very convenient
regressor.fit(X_train, Y_train)

### Predicting Test Set results ###

y_pred = regressor.predict(X_test)
np.set_printoptions(precision = 2) # Display any numerical value with only 2 numebrs after decimal
print(np.concatenate((y_pred.reshape(len(y_pred),1), Y_test.reshape(len(Y_test),1 )), axis=1)) # this just simply makes everything vertical

from sklearn.metrics import mean_squared_error 
mse = mean_squared_error(Y_test, y_pred)
print(mse)

OUTPUT: 
[[489066.76 300334.  ]
 [227458.2  200352.  ]
 [928249.59 946729.  ]
 [339032.27 350116.  ]
 [689668.21 600322.  ]
 [489179.58 577936.  ]]
...
...


MSE = 2375985640.8102403

【问题讨论】:

    标签: python pandas numpy machine-learning scikit-learn


    【解决方案1】:

    您可以自己计算mse来检查是否有问题。在我看来,获得的结果是连贯的。无论如何,我用你的示例数据构建了一个简单的 my_mse 函数来检查 sklearn 输出的结果

    from sklearn.metrics import mean_squared_error 
    
    list_ = [[489066.76, 300334.], 
    [227458.2,  200352.  ],
    [928249.59, 946729.  ],
    [339032.27, 350116.  ],
    [689668.21, 600322.  ],
    [489179.58, 577936.  ]]
    
    y_true = [y[0] for y in list_]
    y_pred = [y[1] for y in list_]
    
    mse = mean_squared_error(y_true, y_pred)
    print(mse)
    # 8779930962.14985
    
    def my_mse(y_true, y_pred):
      diff = 0
      for couple in zip(y_true, y_pred):
        diff+=pow(couple[0]-couple[1], 2)
      return diff/len(y_true)
    
    print(my_mse(y_true, y_pred))
    # 8779930962.14985
    

    请记住,mse 是平均 平方 误差。 (每个误差都是求和的平方)

    如果您要问您的模型是好是坏,这取决于主要目标。无论如何,我认为你的模型表现不佳,因为它是一个线性模型。更复杂的模型可以处理问题并输出更好的结果

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2021-11-10
      • 1970-01-01
      • 2021-12-11
      • 2012-10-18
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多