【问题标题】:recipes::step_dummy + caret::train -> Error:Not all variables in the recipe are presentrecipes::step_dummy + caret::train -> 错误:并非配方中的所有变量都存在
【发布时间】:2019-03-13 01:01:08
【问题描述】:

在使用带有 caret::train 的 recipes::step_dummy 时出现以下错误(第一次尝试组合这两个包):

错误:并非配方中的所有变量都存在于提供的 训练集

不确定是什么导致了错误,也不确定调试的最佳方法。非常感谢帮助训练模型。

library(caret)
library(tidyverse)
library(recipes)
library(rsample)

data("credit_data")

## Split the data into training (75%) and test sets (25%)
set.seed(100)
train_test_split <- initial_split(credit_data)
credit_train <- training(train_test_split)
credit_test <- testing(train_test_split)

# Create recipe for data pre-processing
rec_obj <- recipe(Status ~ ., data = credit_train) %>%
  step_knnimpute(all_predictors()) %>%
  #step_other(Home, Marital, threshold = .2, other = "other") %>%
  #step_other(Job, threshold = .2, other = "others") %>%
  step_dummy(Records)  %>% 
  step_center(all_numeric())  %>%
  step_scale(all_numeric()) %>%
  prep(training = credit_train, retain = TRUE) 

train_data <- juice(rec_obj)
test_data  <- bake(rec_obj, credit_test)

set.seed(1055)
# the glm function models the second factor level.
lrfit <- train(rec_obj, data = train_data,
                     method = "glm",
                     trControl = trainControl(method = "repeatedcv", 
                                              repeats = 5))

【问题讨论】:

    标签: r r-caret r-recipes


    【解决方案1】:

    在将配方交给train 之前不要准备好配方并使用原始训练集:

    library(caret)
    #> Loading required package: lattice
    #> Loading required package: ggplot2
    library(tidyverse)
    library(recipes)
    #> 
    #> Attaching package: 'recipes'
    #> The following object is masked from 'package:stringr':
    #> 
    #>     fixed
    #> The following object is masked from 'package:stats':
    #> 
    #>     step
    library(rsample)
    
    data("credit_data")
    
    ## Split the data into training (75%) and test sets (25%)
    set.seed(100)
    train_test_split <- initial_split(credit_data)
    credit_train <- training(train_test_split)
    credit_test <- testing(train_test_split)
    
    # Create recipe for data pre-processing
    rec_obj <- 
      recipe(Status ~ ., data = credit_train) %>%
      step_knnimpute(all_predictors()) %>%
      #step_other(Home, Marital, threshold = .2, other = "other") %>%
      #step_other(Job, threshold = .2, other = "others") %>%
      step_dummy(Records)  %>% 
      step_center(all_numeric())  %>%
      step_scale(all_numeric()) 
    
    set.seed(1055)
    # the glm function models the second factor level.
    lrfit <- train(rec_obj, data = credit_train,
                   method = "glm",
                   trControl = trainControl(method = "repeatedcv", 
                                            repeats = 5))
    lrfit
    #> Generalized Linear Model 
    #> 
    #> 3341 samples
    #>   13 predictor
    #>    2 classes: 'bad', 'good' 
    #> 
    #> Recipe steps: knnimpute, dummy, center, scale 
    #> Resampling: Cross-Validated (10 fold, repeated 5 times) 
    #> Summary of sample sizes: 3006, 3008, 3007, 3007, 3007, 3007, ... 
    #> Resampling results:
    #> 
    #>   Accuracy   Kappa    
    #>   0.7965349  0.4546223
    

    reprex package (v0.2.1) 于 2019-03-20 创建

    【讨论】:

    • 在更大规模的应用程序中,我收到消息“在[&lt;-.factor(*tmp*, !is_complete(data), value = "Missing") : invalid factor level, NA generated",我想知道是否需要通过 strings_as_factors= FALSE 以某种方式进行训练?如果我在配方中添加 step_factor2string(all_nominal()) 作为第一步,我将不再收到错误消息, - 还有其他方法吗?
    【解决方案2】:

    看来你还需要train函数中的公式(尽管已经列在recipe中了)?...

    glmfit <- train(Status ~ ., data = juice(rec_obj),
                         method = "glm",
                         trControl = trainControl(method = "repeatedcv", repeats = 5))
    

    【讨论】:

    • 不,这不是问题所在。 train 将重新准备数据,因此需要原始数据(不是榨汁版本)
    猜你喜欢
    • 2021-02-21
    • 1970-01-01
    • 1970-01-01
    • 2021-03-28
    • 2011-09-22
    • 2021-12-18
    • 1970-01-01
    • 2021-12-11
    • 2020-05-06
    相关资源
    最近更新 更多