线性回归包含 NaN 值答案

【问题标题】：Linear Regression Contain NaN Values线性回归包含 NaN 值
【发布时间】：2021-02-26 17:37:32
【问题描述】：

我有一个数组，其中包含这样的汽车价格

| Date | Price($) |
| -------- | --------------  |
| 2019-09-01| NaN            |
| 2019-09-02| NaN            |
| 2019-09-03| 250            |
| 2019-09-04| 200            |
| 2019-09-05| 300            |

这里的问题是我想做一个线性回归来预测这辆车在下个月的价格（例如：2019-10-01 的汽车价格是...$）。但是当我尝试将输入拟合到线性回归模型时，我遇到了这个错误：ValueError: Input contains NaN, infinity or a value too large for dtype('float64'). 代码如下：

data = mydata #load my data
X = data.iloc[:, 0].values.reshape(-1, 1)  # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1)  # -1 means that calculate the dimension of rows, but have 1 column
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X, Y)  # perform linear regression
Y_pred = linear_regressor.predict(X)  # make predictions

【问题讨论】：

标签： python linear-regression

【解决方案1】：

我认为，一种更简单的方法是在数据帧级别本身 dropna()。

data= data.dropna(axis= 0, how='any')

然后，所有包含 na 数据的行都将被删除，回归可以顺利进行。

【讨论】：

这行代码帮助我减少线性回归中由平均值引起的噪声。非常感谢！！！

【解决方案2】：

LinearRegression 将无法在缺少数据的点上进行训练。

作为一种解决方法，您可以使用SimpleImputer 填充这些缺失的数据点。

import numpy as np
data = mydata #load my data
X = data.iloc[:, 0].values.reshape(-1, 1)  # values converts it into a numpy array
Y = data.iloc[:, 1].values.reshape(-1, 1)  # -1 means that calculate the dimension of rows, but have 1 column
# imputing Y data points to fill missing values
from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
imputer = imputer.fit(Y)
Y_imputed = imputer.transform(Y)
# using imputed data for training
linear_regressor = LinearRegression()  # create object for the class
linear_regressor.fit(X, Y_imputed)  # perform linear regression
Y_pred = linear_regressor.predict(X)  # make predictions

这里，Y 中的 NaN 将被填充。

注意：如果您无法使用插补来填充这些 NaN 值，那么您应该尽量避免使用这些数据点进行训练。

更新： Impute 函数自 0.23+ 起在 scikit-learn 中已弃用

【讨论】：

如果你能用我的代码更仔细地解释一下就好了
我试过了，但仍然收到此错误 ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
我用np.nan更新了代码。理想情况下，它应该可以工作，并且应该使用数据的mean 正确填充您的 NaN 值。