多变量和多步骤的训练和测试拆分？答案

【问题标题】：Train and test split for multivariate and multi-step?多变量和多步骤的训练和测试拆分？
【发布时间】：2020-06-22 18:22:18
【问题描述】：

使用这个tutorial，它只处理多变量和一步，我一直在尝试编写多变量和多步代码。由于代码太长，我附上它here（您也可以在同一个存储库中找到数据集）。

代码的目的是预测未来 6 小时的污染值。

在对数据进行预处理和标准化后，我将数据拆分并重新整形，如下所示：

# split into train and test sets
values = reframed.values
n_train_hours = 365 * 24 * 1 # 5 years data 1 year training
train = values[:n_train_hours, :]
test = values[n_train_hours:n_train_hours+50, :]

# split into input and outputs
n_obs = n_hours * n_features
train_X, train_y = train[:, :n_obs], train[:, :n_out]  # the problem is here 
test_X, test_y = test[:, :n_obs], test[:,:n_out] # and here

# reshape input to be 3D [samples, timesteps, features]
train_X = train_X.reshape((train_X.shape[0], n_hours, n_features))
train_y = train_y
test_X = test_X.reshape((test_X.shape[0], n_hours, n_features))
print(train_X.shape, train_y.shape, test_X.shape, test_y.shape)

我对 train_X、train_y、test_X 和 test_y 有疑问，我不确定是否必须使用 n_out 和 n_obs 而不是 n_out*n_features 和 n_obs 或其他替代方案，因为在这两者中在inv_y 上使用inverse_transform 的情况下，我得到的值与数据集上的真实值不同：

print('predicted: ',inv_yhat)
print('real:' ,inv_y)

predicted:  [  3.04286     7.406884    6.121824  ... -10.307352   -7.0151763
  -3.4667058]
real: [36.       30.999998 19.999998 ... 24.999998 48.       48.999996]

如果您需要更多详细信息，请告诉我。

【问题讨论】：

标签： python machine-learning deep-learning dataset forecasting

【解决方案1】：

在我的 Macbook Pro 上的 PyCharm 和 Python 3.7.3 中使用您的确切代码，我使用相同的数据集得到以下结果：

Using TensorFlow backend.
                     pollution  dew  temp   press wnd_dir  wnd_spd  snow  rain
date                                                                          
2010-01-02 00:00:00      129.0  -16  -4.0  1020.0      SE     1.79     0     0
2010-01-02 01:00:00      148.0  -15  -4.0  1020.0      SE     2.68     0     0
2010-01-02 02:00:00      159.0  -11  -5.0  1021.0      SE     3.57     0     0
(8760, 30, 8) (8760, 6) (35005, 30, 8) (35005, 6)
2020-06-22 16:32:21.057171: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2020-06-22 16:32:21.072767: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7fd8396b94b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-06-22 16:32:21.072780: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
Epoch 1/3
122/122 - 3s - loss: 0.2229 - val_loss: 0.1685
Epoch 2/3
122/122 - 2s - loss: 0.1521 - val_loss: 0.1490
Epoch 3/3
122/122 - 2s - loss: 0.1433 - val_loss: 0.1415
(35005, 6)
(35005, 6)
Test RMSE: 191.383
predicted:  [57.93556  30.241518 41.339252 ... 39.774303 45.738094 49.2507  ]
real: [36.       30.999998 19.999998 ... 24.999998 48.       48.999996]

我的预测值看起来更真实一些。哎呀，我希望这不是处理器架构类型的问题。

添加有关 MinMaxScaler 功能范围的附加信息

https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html

您可以像这样将结果写入文件：

results = pd.DataFrame(inv_yhat)
results.index = X_test.index
results.columns = ["prediction"]
results.to_csv("prediction_results.csv")

【讨论】：

嗨@Jeanpierre，感谢您的回复。我注意到，每当我重新运行代码时，预测值也会发生变化。但是，打印的真实数据不存在我检查了我的数据集以验证它，但我没有找到它。您认为这可能是 MinMaxscaler 问题吗？
我认为这是可能的，这里有一些信息。也许 feature_range 过于受限，需要一个或两个轴的完整范围：
至于查看结果数据，它只是被打印到屏幕上，如果你愿意，你必须把它写到一个文件中