【问题标题】:Difference in memory usage between gbm and blackboostgbm 和 blackboost 的内存使用差异
【发布时间】:2014-06-02 23:15:30
【问题描述】:

我正在研究一个包含大约 250000 个观察值和 50 个预测变量的数据库(有些是最终大约 100 个特征的因素),我无法使用给我分配内存的 blackboost() 函数(来自 mboost 包)错误。

同时gbm()处理数据量也没有问题。 根据文档,blackboost 使用的算法与 gbm 相同。 ("http://cran.r-project.org/web/packages/mboost/mboost.pdf")。

不清楚为什么一个函数能够管理数据库而不是另一个函数,我的猜测:

  • gbm 有一个子采样策略(由“bag.fraction”参数设置),它似乎没有在 blackboost 中实现并影响内存使用。
  • gbm 使用 CART 函数来构建树,而 blackboost 使用 ctree,这似乎具有巨大的内存占用 (How to remove training data from party:::ctree models?)

我想使用 mboost 中可用但 gbm 中不可用的 AUC() 损失函数,因此我对克服 blackboost 内存使用限制的任何建议感兴趣。

另一个问题,当我尝试减少模型中的变量数量时,我从 blackboost 收到了这个新错误:

Error in matrix(f[ind1], nrow = n0, ncol = n1, byrow = TRUE) : the length of the data [107324] is not a multiple of the number of lines [152107]

好像来自AUC梯度函数。

感谢您的帮助。

【问题讨论】:

    标签: r gbm


    【解决方案1】:

    ctree 是原因之一是正确的。我在下面展示了一个脚本来说明这一点。正如我所展示的,您可以通过设置control = party::ctree_control(..., remove_weights = TRUE) 来减少内存需求。但是,据我所知,您无法避免额外存储的data.frame 和其他一些内存使用原因。

    示例如下:

    # Load data and set options
    options(digits = 4)
    data("BostonHousing", package = "mlbench")
    
    # Size of the training size
    object.size(BostonHousing) / 10^6 # in MB
    #> 0.1 bytes
    
    # blackboost and mboost stores a ctree like structure not on the object itself 
    # but in an environment in the background. These can be big!
    # First, we use some of the default settings
    ctrl_lrg_mem <- party::ctree_control(
      teststat = "max",
      testtype = "Teststatistic",
      mincriterion = 0,
      maxdepth = 3,
      stump = FALSE,
      minbucket = 20,
      savesplitstats = FALSE, # Default w/ mboost
      remove_weights = FALSE) # Default w/ mboost
    
    gc() # shows memory usage before
    #>           used  (Mb) gc trigger  (Mb) max used  (Mb)
    #> Ncells 2467924 131.9    3886542 207.6  3886542 207.6
    #> Vcells 4553719  34.8   14341338 109.5 22408297 171.0
    fit1 <- mboost::blackboost(
      medv ~ ., data = BostonHousing,
      tree_controls = ctrl_lrg_mem,
      control = mboost::boost_control(
        mstop = 100))
    gc() # shows memory usage after
    #>           used  (Mb) gc trigger  (Mb) max used  (Mb)
    #> Ncells 2494735 133.3    3886542 207.6  3886542 207.6
    #> Vcells 5608368  42.8   14341338 109.5 22408297 171.0
    
    # It is not the object it self that requires a lot of memory 
    object.size(fit1) / 10^6
    #> 1.3 bytes
    
    # It is the objects stored in the environments in the back
    tmp_env <- environment(fit1$predict)
    length(tmp_env$ens) # The boosted trees
    #> [1] 100
    sum(unlist(lapply(tmp_env$ens, object.size))) / 10^6
    #> [1] 7.312
    
    # Moreover, there is also a model frame for the data stored in the baselearner 
    # function's environment which takes some space
    env <- environment(fit1$basemodel[[1]]$fit)
    str(env$df) # data frame of initial data
    #> 'data.frame':    506 obs. of  14 variables:
    #>  $ crim                     : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
    #>  $ zn                       : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
    #>  $ indus                    : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
    #>  $ chas                     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
    #>  $ nox                      : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
    #>  $ rm                       : num  6.58 6.42 7.18 7 7.15 ...
    #>  $ age                      : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
    #>  $ dis                      : num  4.09 4.97 4.97 6.06 6.06 ...
    #>  $ rad                      : num  1 2 2 3 3 3 5 5 5 5 ...
    #>  $ tax                      : num  296 242 242 222 222 222 311 311 311 311 ...
    #>  $ ptratio                  : num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
    #>  $ b                        : num  397 397 393 395 397 ...
    #>  $ lstat                    : num  4.98 9.14 4.03 2.94 5.33 ...
    #>  $ WLKJDJDQYBTDQCZDNHZMPZNCS: num  0 0 0 0 0 0 0 0 0 0 ...
    object.size(env$df) / 10^6
    #> 0.1 bytes
    # str(env$object) # output excluded for space reasons
    object.size(env$object) / 10^6
    #> 0.8 bytes
    
    # The above implies that if you data is 1GB then the fit will require 1 GB as
    # well as far as I gather
    
    # We can though reduce the memory requirements
    ctrl_sml_mem <- party::ctree_control(
      teststat = "max",
      testtype = "Teststatistic",
      mincriterion = 0,
      maxdepth = 3,
      stump = FALSE,
      minbucket = 20,
      savesplitstats = FALSE,
      remove_weights = TRUE)  # Changed
    
    gc()
    #>           used  (Mb) gc trigger  (Mb) max used  (Mb)
    #> Ncells 2494810 133.3    3886542 207.6  3886542 207.6
    #> Vcells 5608406  42.8   14341338 109.5 22408297 171.0
    fit2 <- mboost::blackboost(
      medv ~ ., data = BostonHousing,
      tree_controls = ctrl_sml_mem,
      control = mboost::boost_control(
        mstop = 100))
    gc()
    #>           used  (Mb) gc trigger  (Mb) max used  (Mb)
    #> Ncells 2520425 134.7    3886542 207.6  3886542 207.6
    #> Vcells 6081411  46.4   14341338 109.5 22408297 171.0
    
    # Reduces the size of the objects in the back
    tmp_env <- environment(fit2$predict)
    length(tmp_env$ens) # The boosted trees
    #> [1] 100
    sum(unlist(lapply(tmp_env$ens, object.size))) / 10^6
    #> [1] 2.611
    
    #####
    # The version I run
    sessionInfo(package = c("party", "mboost"))
    #> R version 3.4.0 (2017-04-21)
    #> Platform: x86_64-w64-mingw32/x64 (64-bit)
    #> Running under: Windows >= 8 x64 (build 9200)
    #> 
    #> Matrix products: default
    #> 
    #> locale:
    #> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United Kingdom.1252   
    #> [3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C                           
    #> [5] LC_TIME=English_United Kingdom.1252    
    #> 
    #> attached base packages:
    #> character(0)
    #> 
    #> other attached packages:
    #> [1] party_1.2-3  mboost_2.8-0
    #> 
    #> loaded via a namespace (and not attached):
    #>  [1] Rcpp_0.12.11        compiler_3.4.0      formatR_1.4         git2r_0.18.0        R.methodsS3_1.7.1  
    #>  [6] methods_3.4.0       R.utils_2.5.0       utils_3.4.0         tools_3.4.0         grDevices_3.4.0    
    #> [11] boot_1.3-19         digest_0.6.12       jsonlite_1.4        memoise_1.1.0       R.cache_0.12.0     
    #> [16] lattice_0.20-35     Matrix_1.2-9        shiny_1.0.2         parallel_3.4.0      curl_2.5           
    #> [21] mvtnorm_1.0-6       speedglm_0.3-2      coin_1.1-3          R.rsp_0.41.0        withr_1.0.2        
    #> [26] httr_1.2.1          stringr_1.2.0       knitr_1.15.1        stabs_0.6-2         graphics_3.4.0     
    #> [31] datasets_3.4.0      stats_3.4.0         devtools_1.12.0     stats4_3.4.0        dynamichazard_0.3.0
    #> [36] grid_3.4.0          base_3.4.0          data.table_1.10.4   R6_2.2.0            survival_2.41-2    
    #> [41] multcomp_1.4-6      TH.data_1.0-8       magrittr_1.5        nnls_1.4            codetools_0.2-15   
    #> [46] modeltools_0.2-21   htmltools_0.3.6     splines_3.4.0       MASS_7.3-47         rsconnect_0.7      
    #> [51] strucchange_1.5-1   mime_0.5            xtable_1.8-2        httpuv_1.3.3        quadprog_1.5-5     
    #> [56] sandwich_2.3-4      stringi_1.1.5       zoo_1.8-0           R.oo_1.21.0
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-10-16
      • 1970-01-01
      • 2022-11-19
      • 2011-06-01
      • 2011-10-15
      • 2023-01-18
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多