比较线性回归中 StandardScaler 与 Normalizer 的结果答案

【问题标题】：Comparing Results from StandardScaler vs Normalizer in Linear Regression比较线性回归中 StandardScaler 与 Normalizer 的结果
【发布时间】：2019-06-01 17:02:05
【问题描述】：

我正在研究一些不同场景下的线性回归示例，比较使用Normalizer和StandardScaler的结果，结果令人费解。

我正在使用波士顿住房数据集，并以这种方式进行准备：

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

#load the data
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE'] = boston.target

我目前正在尝试推理我从以下场景中得到的结果：

使用参数normalize=True 与使用Normalizer 初始化线性回归
使用参数fit_intercept = False 初始化线性回归，使用和不使用标准化。

总的来说，我发现结果令人困惑。

我是这样设置一切的：

# Prep the data
X = df.iloc[:, :-1]
y = df.iloc[:, -1:]
normal_X = Normalizer().fit_transform(X)
scaled_X = StandardScaler().fit_transform(X)

#now prepare some of the models
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
reg3 = LinearRegression().fit(normal_X, y)
reg4 = LinearRegression().fit(scaled_X, y)
reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)

然后，我创建了 3 个单独的数据框来比较每个模型的 R_score、系数值和预测。

为了创建数据框来比较每个模型的系数值，我执行了以下操作：

#Create a dataframe of the coefficients
coef = pd.DataFrame({
    'coeff':                       reg1.coef_[0],
    'coeff_normalize_true':        reg2.coef_[0],
    'coeff_normalizer':            reg3.coef_[0],
    'coeff_scaler':                reg4.coef_[0],
    'coeff_scaler_no_int':         reg5.coef_[0]
})

以下是我创建数据框以比较每个模型的 R^2 值的方法：

scores = pd.DataFrame({
    'score':                        reg1.score(X, y),
    'score_normalize_true':         reg2.score(X, y),
    'score_normalizer':             reg3.score(normal_X, y),
    'score_scaler':                 reg4.score(scaled_X, y),
    'score_scaler_no_int':          reg5.score(scaled_X, y)
    }, index=range(1)
)

最后，这是比较每个预测的数据框：

predictions = pd.DataFrame({
    'pred':                        reg1.predict(X).ravel(),
    'pred_normalize_true':         reg2.predict(X).ravel(),
    'pred_normalizer':             reg3.predict(normal_X).ravel(),
    'pred_scaler':                 reg4.predict(scaled_X).ravel(),
    'pred_scaler_no_int':          reg5.predict(scaled_X).ravel()
}, index=range(len(y)))

以下是生成的数据框：

系数：

得分：

预测：

我有三个无法解决的问题：

为什么前两个模型之间完全没有区别？似乎设置normalize=False 什么都不做。我可以理解具有相同的预测和 R^2 值，但是我的特征具有不同的数值尺度，所以我不确定为什么规范化根本没有效果。当您考虑到使用 StandardScaler 会显着改变系数时，这会更加令人困惑。
我不明白为什么使用Normalizer 的模型会导致与其他模型完全不同的系数值，尤其是当使用LinearRegression(normalize=True) 的模型根本没有改变时。

如果您查看每个文档的文档，会发现它们非常相似，即使不相同。

来自sklearn.linear_model.LinearRegression() 上的文档：

normalize：布尔值，可选，默认为 False

当 fit_intercept 设置为 False 时忽略此参数。如果为 True，则回归量 X 将在回归前通过减去均值并除以 l2 范数进行归一化。

与此同时，sklearn.preprocessing.Normalizerstates that it normalizes to the l2 norm by default 上的文档。

我看不出这两个选项的作用有什么不同，我不明白为什么一个选项的系数值会与另一个选项有如此巨大的差异。

使用StandardScaler 的模型的结果与我一致，但我不明白为什么使用StandardScaler 并设置set_intercept=False 的模型表现如此糟糕。

来自Linear Regression module上的文档：

fit_intercept : 布尔值，可选，默认 True

是否计算此模型的截距。如果设置为 False，则否
截距将用于计算（例如，预计数据已经
居中）。

StandardScaler 集中您的数据，所以我不明白为什么将它与 fit_intercept=False 一起使用会产生不连贯的结果。

【问题讨论】：

标签： python machine-learning scikit-learn linear-regression

【解决方案1】：

关于 fit_intercept=0 和标准化数据的不连贯结果的最后一个问题 (3) 尚未完全回答。

OP 可能希望 StandardScaler 标准化 X 和 y，这将使截距必然为 0（proof 向下的 1/3）。

但是 StandardScaler 会忽略 y。见api。

TransformedTargetRegressor 提供了一个解决方案。这种方法也适用于因变量的非线性变换，例如 y 的对数变换（但请考虑this）。

这是解决 OP 问题 #3 的示例：

from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import make_pipeline
from sklearn.datasets import make_regression
from sklearn.compose import TransformedTargetRegressor
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
import numpy as np

# define a custom transformer
class stdY(BaseEstimator,TransformerMixin):
    def __init__(self):
        pass
    def fit(self,Y):
        self.std_err_=np.std(Y)
        self.mean_=np.mean(Y)
        return self
    def transform(self,Y):
        return (Y-self.mean_)/self.std_err_
    def inverse_transform(self,Y):
        return Y*self.std_err_+self.mean_

# standardize X and no intercept pipeline
no_int_pipe=make_pipeline(StandardScaler(),LinearRegression(fit_intercept=0)) # only standardizing X, so not expecting a great fit by itself.

# standardize y pipeline
std_lin_reg=TransformedTargetRegressor(regressor=no_int_pipe, transformer=stdY()) # transforms y, estimates the model, then reverses the transformation for evaluating loss.

#after returning to re-read my answer, there's an even easier solution, use StandardScaler as the transfromer:
std_lin_reg_easy=TransformedTargetRegressor(regressor=no_int_pipe, transformer=StandardScaler())

# generate some simple data
X, y, w = make_regression(n_samples=100,
                          n_features=3, # x variables generated and returned 
                          n_informative=3, # x variables included in the actual model of y
                          effective_rank=3, # make less than n_informative for multicollinearity
                          coef=True,
                          noise=0.1,
                          random_state=0,
                          bias=10)

std_lin_reg.fit(X,y)
print('custom transformer on y and no intercept r2_score: ',std_lin_reg.score(X,y))

std_lin_reg_easy.fit(X,y)
print('standard scaler on y and no intercept r2_score: ',std_lin_reg_easy.score(X,y))

no_int_pipe.fit(X,y)
print('\nonly standard scalar and no intercept r2_score: ',no_int_pipe.score(X,y))

custom transformer on y and no intercept r2_score:  0.9999343800041816

standard scaler on y and no intercept r2_score:  0.9999343800041816

only standard scalar and no intercept r2_score:  0.3319175799267782

【讨论】：

【解决方案2】：

前两个模型之间的系数没有差异的原因是Sklearn 在从归一化的输入数据计算出系数后，在后台对系数进行了反归一化。 Reference

这种反规范化已经完成，因为对于测试数据，我们可以直接应用 co-effs。并在不标准化测试数据的情况下获得预测。

因此，设置normalize=True 确实会影响系数，但无论如何它们不会影响最佳拟合线。

Normalizer 对每个样本进行归一化（意味着逐行）。您会看到参考代码here。

From documentation:

将样本单独归一化为单位范数。

而normalize=True 对每个列/特征进行标准化。 Reference

通过示例了解归一化对数据不同维度的影响。让我们取两个维度 x1 & x2 和 y 作为目标变量。目标变量值在图中用颜色编码。

import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer,StandardScaler
from sklearn.preprocessing.data import normalize

n=50
x1 = np.random.normal(0, 2, size=n)
x2 = np.random.normal(0, 2, size=n)
noise = np.random.normal(0, 1, size=n)
y = 5 + 0.5*x1 + 2.5*x2 + noise

fig,ax=plt.subplots(1,4,figsize=(20,6))

ax[0].scatter(x1,x2,c=y)
ax[0].set_title('raw_data',size=15)

X = np.column_stack((x1,x2))

column_normalized=normalize(X, axis=0)
ax[1].scatter(column_normalized[:,0],column_normalized[:,1],c=y)
ax[1].set_title('column_normalized data',size=15)

row_normalized=Normalizer().fit_transform(X)
ax[2].scatter(row_normalized[:,0],row_normalized[:,1],c=y)
ax[2].set_title('row_normalized data',size=15)

standardized_data=StandardScaler().fit_transform(X)
ax[3].scatter(standardized_data[:,0],standardized_data[:,1],c=y)
ax[3].set_title('standardized data',size=15)

plt.subplots_adjust(left=0.3, bottom=None, right=0.9, top=None, wspace=0.3, hspace=None)
plt.show()

您可以看到图 1,2 和 4 中数据的最佳拟合线是相同的；表示 R2_-score 不会因列/特征标准化或标准化数据而改变。就是这样，它最终会产生不同的协同效应。价值观。

注意：fig3 的最佳拟合线会有所不同。

当您设置 fit_intercept=False 时，会从预测中减去偏差项。意味着截距设置为零，否则将是目标变量的平均值。

截距为零的prediction 对于目标变量未缩放（均值 =0）的问题可能会表现不佳。您可以在每一行中看到 22.532 的差异，这表示输出的影响。

【讨论】：

当您@Venkatachalam 输入“这种反标准化已经完成，因此任何测试数据，我们可以直接应用 co-effs。并通过标准化测试数据获得预测。”您的意思是“没有规范化测试数据”吗？
是的，你是对的，我的意思是没有标准化测试数据。

【解决方案3】：

回答问题 1

我假设您对前 2 个模型的意思是 reg1 和 reg2。如果不是这样，请告诉我们。

无论您是否对数据进行标准化，线性回归都具有相同的预测能力。因此，使用normalize=True 对预测没有影响。理解这一点的一种方法是查看归一化（按列）是对每一列 ((x-a)/b) 的线性操作，并且线性回归上数据的线性变换不会影响系数估计，只会改变它们的值。请注意，此陈述不适用于 Lasso/Ridge/ElasticNet。

那么，为什么系数没有不同？好吧，normalize=True 还考虑到用户通常想要的是原始特征上的系数，而不是归一化特征。因此，它调整系数。检查这是否有意义的一种方法是使用一个更简单的示例：

# two features, normal distributed with sigma=10
x1 = np.random.normal(0, 10, size=100)
x2 = np.random.normal(0, 10, size=100)

# y is related to each of them plus some noise
y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)

X = np.array([x1, x2]).T  # X has two columns

reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)

# check that coefficients are the same and equal to [2,1]
np.testing.assert_allclose(reg1.coef_, reg2.coef_) 
np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

这证实了这两种方法都正确捕获了 [x1,x2] 和 y 之间的真实信号，即分别为 2 和 1。

Q2 答案

Normalizer 不是您所期望的。它按行规范化每一行。因此，结果将发生巨大变化，并且可能会破坏特征与您想要避免的目标之间的关系，除了特定情况（例如 TF-IDF）。

要了解如何做，假设上面的示例，但考虑一个不同的功能，x3，它与y 无关。使用Normalizer 会导致x1 被x3 的值修改，从而降低其与y 的关系强度。

模型 (1,2) 和 (4,5) 之间的系数差异

系数之间的差异在于，当您在拟合之前进行标准化时，系数将与标准化特征相关，与我在答案第一部分中提到的系数相同。可以使用reg4.coef_ / scaler.scale_将它们映射到原始参数：

x1 = np.random.normal(0, 10, size=100)
x2 = np.random.normal(0, 10, size=100)
y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)
X = np.array([x1, x2]).T

reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
scaler = StandardScaler()
reg4 = LinearRegression().fit(scaler.fit_transform(X), y)

np.testing.assert_allclose(reg1.coef_, reg2.coef_) 
np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

# here
coefficients = reg4.coef_ / scaler.scale_
np.testing.assert_allclose(coefficients, np.array([2, 1]), rtol=0.01)

这是因为，在数学上，设置 z = (x - mu)/sigma，模型 reg4 正在求解 y = a1*z1 + a2*z2 + a0。我们可以通过简单的代数恢复y和x的关系：y = a1*[(x1 - mu1)/sigma1] + a2*[(x2 - mu2)/sigma2] + a0，可以简化为y = (a1/sigma1)*x1 + (a2/sigma2)*x2 + (a0 - a1*mu1/sigma1 - a2*mu2/sigma2)。

reg4.coef_ / scaler.scale_ 代表上述符号中的[a1/sigma1, a2/sigma2]，这正是normalize=True 所做的以保证系数相同。

模型 5 的得分差异。

标准化特征是零均值，但目标变量不一定。因此，不拟合截距会导致模型忽略目标的平均值。在我一直使用的示例中，y = 3 + ... 中的“3”没有拟合，这自然会降低模型的预测能力。 :)

【讨论】：