尽管设置了种子，但无法重现 H2O GBM 预测答案

【问题标题】：Unable to reproduce H2O GBM predictions despite setting seed尽管设置了种子，但无法重现 H2O GBM 预测
【发布时间】：2019-05-26 11:22:51
【问题描述】：

我正在尝试在 for 循环中对不同的响应变量运行多个 H2O 模型。

H2O cluster uptime:         53 mins 11 secs
H2O cluster timezone:       Etc/UTC
H2O data parsing timezone:  UTC
H2O cluster version:        3.22.1.1
H2O cluster version age:    2 hours and 15 minutes
H2O cluster name:           H2O_from_python_root_np3l2m
H2O cluster total nodes:    1
H2O cluster free memory:    13.01 Gb
H2O cluster total cores:    8
H2O cluster allowed cores:  8
H2O cluster status:         locked, healthy
H2O connection url:         http://localhost:54321
H2O connection proxy:
H2O internal security:      False
H2O API Extensions:         XGBoost, Algos, AutoML, Core V3, Core V4
Python version:             2.7.12 final

我已经为选择训练/验证集和模型本身设置了种子。我有提前停止活动，但根据文档，只要 score_tree_interval 处于活动状态，结果应该是可重现的。

### This is the code that's defining the model

def append_probs(hframe, response_col, model):
  pd_df = h2o.as_list(hframe).copy()
  pd_df.loc[:,'pred'] = h2o.as_list(model.predict(hframe)).values
  pd_df.loc[:,'error'] = pd_df['pred'] - pd_df[response_col]
  return pd_df

def run_model(response_col, model_typ, hframe_train, hframe_pred):
  h2o_dtypes = [hframe_train.type(e) for e in hframe_train.columns]
  data = h2o.deep_copy(hframe_train,'data')
  mapping = {'new_email_ldsub':'live_pp',
             'new_call_ldsub':'live_pp',
             'used_email_ldsub':'live_usedplus',
             'used_call_ldsub':'live_usedplus',
             'myapp_edm_ldsub':'live_myapp',
             'cc_edm_ldsub':'live_cc',
             'fbm_call_ldsub':'live_fbm',
             'fbm_email_ldsub':'live_fbm'}
  data = data[data[mapping[response_col]]==1]

  train, valid = data.split_frame([0.8], seed=1234)

  X = hframe_train.col_names[:-14]
  print X
  y = response_col
  print y

  if model_typ == 'gbm':
    model = H2OGradientBoostingEstimator(
      ntrees=512,
      learn_rate=0.08,
      max_depth=7,
      col_sample_rate = 0.7,
      sample_rate = 0.9,
      stopping_tolerance=1e-05,
      stopping_rounds=2,
      score_tree_interval=5,
      #nfolds=5,
      #fold_assignment = "Random",
      distribution = 'poisson',
      seed=20000,
      stopping_metric='mae',
      min_rows = 10,
      nbins = 30

  model.train(X, y, training_frame=train, validation_frame=valid)

  pred_df = append_probs(hframe_pred,response_col,model)

  return model, pred_df

### This is the code that runs the model

gbm_results = pd.DataFrame()

gbm_mapping = {'live_pp':['new_call_ldsub','new_email_ldsub'],
           'live_usedplus':['used_call_ldsub','used_email_ldsub'],
           'live_myapp':['myapp_edm_ldsub'],
           'live_cc':['cc_edm_ldsub'],
           'live_fbm':['fbm_call_ldsub','fbm_email_ldsub']}

gbm_train_err = {}
gbm_valid_err = {}
gbm_xval_err = {}


for k,v in gbm_mapping.iteritems():
  for e in v:
    gbm_mod, gbm_pred_df = run_model(e,'gbm',hframe,hframe_forecast_pred)
    gbm_pred_df = gbm_pred_df[['id','month','pred']]
    gbm_pred_df = gbm_pred_df.groupby(['id','month'])['pred'].sum().reset_index()
    gbm_pred_df.loc[:,'product'] = str(e)
    gbm_train_err[str(e)] = [gbm_mod.mae(train=True),gbm_mod.rmse(train=True)]
    gbm_valid_err[str(e)] = [gbm_mod.mae(valid=True),gbm_mod.rmse(valid=True)]
    gbm_xval_err[str(e)] = [gbm_mod.mae(xval=True),gbm_mod.rmse(xval=True)]
    gbm_results = pd.concat([gbm_results, gbm_pred_df])

gbm_results['process_month'] = pd.to_datetime(gbm_results['process_month'],unit='ms')

根据文档，我希望每个模型的结果都是可重现/接近的。

【问题讨论】：

嗨@Lee，您看到的是不同的结果，还是只是为了确认您会看到可重现的结果。如果您看到不同的结果，您能否发布一些结果示例，以便我们了解不同的运行距离有多远？谢谢！
另一个请求：您能否用一个简短且完全可重现的示例更新您的问题？如果您在单个节点上运行并设置种子以及 score_tree_interval（以及此处列出的其他几个标准：docs.h2o.ai/h2o/latest-stable/h2o-docs/data-science/gbm-faq/… 您的模型应该是可重现的。
嗨@Lauren，感谢您的帮助。我看到了不同的结果。当我关闭提前停止时，我能够重现结果，但是使用设置 score_tree_interval 的提前停止每次都会给我不同的预测。如果我可以在玩具数据集上复制它，我会尝试上传一些示例。

标签： python pandas h2o gbm

【解决方案1】：

从最新版本的 H2O-3 3.22.1.1 开始，文档here 中列出了再现性要求。

为方便起见，以下是模型在单个节点上的重现性要求：

请注意，除了种子之外，您还需要使用相同的数据（相同的拆分）、相同的参数，并且要么不使用提前停止，要么使用设置了 score_tree_interval 的提前停止。

如何保证单节点集群的重现性？

必须满足以下条件才能保证在单节点集群中的可重复性：

相同的训练数据

注意：如果您让 H2O 导入包含多个文件而不是单个文件的整个目录，我们不保证可重复性，因为在导入过程中数据可能会被打乱。

用于训练模型的参数相同
相同的种子集（完成任何采样时都需要这样做）
使用 score_tree_interval 集和相同的验证数据没有提前停止或提前停止

【讨论】：

我想我明白了。我使用了两个不同的集群，它们使用了两个不同版本的 h2o。感谢您的帮助！