使用 scikit-hts 进行分组时间序列预测答案

【问题标题】：Grouped Time Series forecasting with scikit-hts使用 scikit-hts 进行分组时间序列预测
【发布时间】：2021-10-26 13:52:33
【问题描述】：

我正在尝试预测我从 kaggle 的Store item demand forecasting challenge 获取的多个时间序列的销售额。它由 10 家商店和 50 件商品的长格式时间序列组成，从而形成 500 个相互堆叠的时间序列。对于每家商店和每件商品，我都有 5 年的每日记录，包含每周和每年的季节性。

总共有：365.2 天 * 5 年 * 10 家商店 * 50 件 = 913000 条记录。

根据我目前在Hierarchical and Grouped time series 上所读到的内容，根据我的理解，整个数据帧可以构造为分组时间序列，而不仅仅是严格的分层时间序列，因为聚合可以在商店或项目中完成水平互换。

我想找到一种方法来使用 scikit-hts 预测明年（从 2015 年 1 月 1 日到 2015 年 12 月 31 日）的所有 500 个时间序列（对于 store1_item1、store1_item2、...、store10_item50）库及其 AutoArimaModel 函数，它是 pmdarima 的 AutoArima 函数的包装函数。

为了处理两个级别的季节性，我添加了傅立叶项作为外生特征来处理年度季节性，而 auto_arima 处理每周季节性。

我的问题是在预测步骤中出现错误。

这是错误信息：

ValueError：提供的外生值的形状不合适。需要 (365, 4)，得到 (365, 8)。

我认为外生字典有问题，但我不知道如何解决这个问题，因为我是第一次使用 scikit-hts。为此，我遵循了 scikit-hts 的官方文档here。

编辑：______________________________________________________________

我没有看到在Github 上报告过类似的错误。按照我在本地实施的建议修复，我可以得到一些结果。然而，即使运行代码时没有错误，一些预测是负面的，正如本文下方的 cmets 提出的那样。我们甚至会得到不成比例的正值。

这里是 store 和 item 的所有组合的图。您可以看到这似乎只适用于一种组合。

df.loc['2014','store_1_item_1'].plot()
predictions.loc['2015','store_1_item_1'].plot()

df.loc['2014','store_1_item_2'].plot()
predictions.loc['2015','store_1_item_2'].plot()

df.loc['2014','store_2_item_1'].plot()
predictions.loc['2015','store_2_item_1'].plot()

df.loc['2014','store_2_item_2'].plot()
predictions.loc['2015','store_2_item_2'].plot()

_____________________________________________________________________

完整代码：

# imports
import pandas as pd
from pmdarima.preprocessing import FourierFeaturizer
import hts
from hts.hierarchy import HierarchyTree
from hts.model import AutoArimaModel
from hts import HTSRegressor


# read data from the csv file
data = pd.read_csv('train.csv', index_col='date', parse_dates=True)

# Train/Test split with reduced size
train_data = data.query('store == [1,2] and item == [1, 2]').loc['2013':'2014']
test_data = data.query('store == [1,2] and item == [1, 2]').loc['2015']


# Create the stores time series
# For each timestamp group by store and apply sum
stores_ts = train_data.drop(columns=['item']).groupby(['date','store']).sum()
stores_ts = stores_ts.unstack('store')
stores_ts.columns = stores_ts.columns.droplevel(0)
stores_ts.columns = ['store_' + str(i) for i in stores_ts.columns]

# Create the items time series
# For each timestamp group by item and apply sum
items_ts = train_data.drop(columns=['store']).groupby(['date','item']).sum()
items_ts = items_ts.unstack('item')
items_ts.columns = items_ts.columns.droplevel(0)
items_ts.columns = ['item_' + str(i) for i in items_ts.columns]


# Create the stores_items time series
# For each timestamp group by store AND by item and apply sum
store_item_ts = train_data.pivot_table(index= 'date', columns=['store', 'item'], aggfunc='sum')
store_item_ts.columns = store_item_ts.columns.droplevel(0)

# Rename the columns as store_i_item_j
col_names = []
for i in store_item_ts.columns:
    col_name = 'store_' + str(i[0]) + '_item_' + str(i[1])
    col_names.append(col_name)
    
store_item_ts.columns = store_item_ts.columns.droplevel(0)
store_item_ts.columns = col_names

# Create a new dataframe and add the root level of the hierarchy as the sum of all stores (or all items)
df = pd.DataFrame()
df['total'] = stores_ts.sum(1) 

# Concatenate all created dataframes into one df
# df is the dataframe that will be used for model training
df = pd.concat([df, stores_ts, items_ts, store_item_ts], 1)


# Build fourier terms for train and test sets
four_terms = FourierFeaturizer(365.2, 1)

# Build the exogenous features dataframe for training data
exog_train_df = pd.DataFrame()

for i in range(1, 3):
    for j in range(1, 3):
        _, exog = four_terms.fit_transform(train_data.query(f'store == {i} and item == {j}').sales)
        exog.columns= [f'store_{i}_item_{j}_'+ x for x in exog.columns]
        exog_train_df = pd.concat([exog_train_df, exog], axis=1)
exog_train_df['date'] = df.index
exog_train_df.set_index('date', inplace=True)

# add the exogenous features dataframe to df before training
df = pd.concat([df, exog_train_df], axis= 1)


# Build the exogenous features dataframe for test set
# It will be used only when using model.predict()
exog_test_df = pd.DataFrame()

for i in range(1, 3):
    for j in range(1, 3):
        _, exog_test = four_terms.fit_transform(test_data.query(f'store == {i} and item == {j}').sales)
        exog_test.columns= [f'store_{i}_item_{j}_'+ x for x in exog_test.columns]
        exog_test_df = pd.concat([exog_test_df, exog_test], axis=1)


# Build the hierarchy of the Grouped Time Series
stores = [i for i in stores_ts.columns]
items = [i for i in items_ts.columns]
store_items = col_names

# Exogenous features mapping
exog_store_items = {e: [v for v in exog_train_df.columns if v.startswith(e)] for e in store_items}  
exog_stores = {e:[v for v in exog_train_df.columns if v.startswith(e)] for e in stores}
exog_items = {e:[v for v in exog_train_df.columns if v.find(e) != -1] for e in items}
exog_total = {'total':[v for v in exog_train_df.columns if v.find('FOURIER') != -1]}

# Merge all dictionaries
exog_to_merge = [exog_store_items, exog_stores, exog_items, exog_total]
exogenous = {k:v for x in exog_to_merge for k,v in x.items()}

# Build hierarchy
total = {'total': stores + items}
store_h = {k: [v for v in store_items if v.startswith(k)] for k in stores}
hierarchy = {**total, **store_h}

# Hierarchy tree automatically created by hts
ht = HierarchyTree.from_nodes(nodes=hierarchy, df=df, exogenous=exogenous)

# Instanciate the auto arima model using HTSRegressor
autoarima = HTSRegressor(model='auto_arima', D=1, m=7, seasonal=True, revision_method='OLS', n_jobs=12)

# Fit the model to the training df that includes time series and exog_train_df
# Set exogenous param to the previously built dictionary
model = autoarima.fit(df, hierarchy, exogenous=exogenous)

# Make predictions
# Set the exogenous_df param 
predictions = model.predict(exogenous_df=exog_test_df, steps_ahead=365)

我想到的其他方法，并且我已经成功实施了一个系列（例如商店 1 和商品 1）：

TBATS 在所有 500 个时间序列的循环内独立应用于每个序列
auto_arima (SARIMAX) 具有独立的每个系列的外生特征（=处理每周和每年季节性的傅里叶项）+ 一个跨所有 500 个时间序列的循环

您如何看待这些方法？对于如何将 ARIMA 扩展到多个时间序列，您还有其他建议吗？

我也想尝试 LSTM，但我是数据科学和深度学习的新手，不知道如何准备数据。我应该将数据保留为原始形式（长格式）并对 train_data['store'] 和 train_data['item'] 列应用一种热编码，还是应该从我在这里结束的 df 开始？

【问题讨论】：

我尝试使用 HTS 包来协调分层分组预测，但结果并不好，因为它返回了一些负值。我正在开发一些功能以启用受约束的预测协调，我可以将您的问题用作测试用例。我得在周末回复你，届时我将有更多时间。
那太好了，谢谢！
@PauloSchauGuerra：我编辑了帖子，得到了一些结果（更改包的源代码后获得）。现在我有了结果，但正如你之前提到的那样，值为负数。
感谢您更新问题。我还没有时间处理它，但下周会这样做。

标签： python time-series lstm forecasting arima

【解决方案1】：

我希望this 帮助您解决外生回归量的问题。要处理负面预测，我建议您尝试平方根变换。

【讨论】：

我假设您是在 Github 上提出错误并提出修复建议的人。谢谢你，因为这很有帮助！但是，正如您所看到的，我还没有完全到达那里，在我编辑的帖子中，我最终得到了一些结果，但即使是正面预测，y 轴上的值也不太接近实际值的比例，更不用说负值了预测。您能否更具体地说明在何处应用平方根变换？您认为什么可以解决正面预测的问题？
您可以使用 HTS Regressor 函数中的transform 参数创建自定义转换器。看Example这里
对于问题2，您可以查看超参数调整。目前，我也在努力改进对我的问题的预测。如果您在那个方向找到任何东西，那将会很有帮助。
谢谢，我去看看转换模块。