训练/验证/测试分割时间 LSTM答案

【问题标题】：Train / Val / Test split time LSTM训练/验证/测试分割时间 LSTM
【发布时间】：2020-01-30 11:07:34
【问题描述】：

我有一个由几个月组成的数据集（从 1 月 15 日到 9 月 17 日），报告每个月的客户财务状况。我的任务是预测每个客户未来 12 个月的累计销售额。

我的数据集看起来像这样（这是原始数据，为了训练我将创建滞后特征）

Month   CustomerID NetSales
JAN-15     A          10
JAN-15     B          10
JAN-15     C          10
FEB-15     A          10
FEB-15     B          10
FEB-15     C          10
...

如何在 TRAIN / VAL / TEST 中以与时间一致的方式拆分它？我可以这样做吗？

TRAIN --> 从 1 月 15 日到 3 月 16 日的所有客户/月（我每个月都在至少一次，因此模型将学习季节性模式
VAL --> 从 APR-16 到 JUN-16 的所有客户/月
TEST --> 从 JUL-16 到 SEP-16 的所有客户/月（我在此停止，因为我需要接下来的 12 个月来创建目标变量）

这是一致的拆分策略吗？或者，您有什么建议？

非常感谢，安德烈亚

【问题讨论】：

你好@andrea-barral，我没有太多经验，但是 kaggle 中的一项旧任务有一个非常好的拆分数据策略：You are provided hourly rental data spanning two years. For this competition, the training set is comprised of the first 19 days of each month, while the test set is the 20th to the end of the month. You must predict the total count of bikes rented during each hour covered by the test set, using only information available prior to the rental period.

标签： python machine-learning scikit-learn data-science train-test-split

【解决方案1】：

这是一致的拆分策略吗？

是的，您尊重这样一个事实，即您不使用训练数据之前的验证集数据，测试集也是如此。您正在防止数据泄露，这是正确的做法。

另外，你有什么建议？

您唯一可以更改的是您的训练集、验证集、测试集的部分，但您可以尝试一下。由于它是一个时间序列，因此您应该考虑季节性趋势，因为它们都包含在您的训练数据中。

【讨论】：