我不知道如何修复 tidymodel 错误答案

【问题标题】：i can't figure out how to fix tidymodel error我不知道如何修复 tidymodel 错误
【发布时间】：2021-06-19 03:16:43
【问题描述】：

# Partition the data:
library(tidymodels)

set.seed(1234)
uni_split <- initial_split(suspicious_match, strata = truth)
uni_train <- training(uni_split)
uni_test <- testing(uni_split)

uni_split

## Build a model recipe :
library(themis)

uni_rec <- recipe(truth ~ lv + lcs + qgram + jaccard + jw + cosine , data = uni_train)%>%
  step_normalize(all_numeric()) %>%
  step_smote(truth, skip = FALSE)%>%
  prep()

uni_rec

bake(uni_rec, new_data = uni_train)

我用多个模型训练了数据：（一个例子）

# Train Logistic Regression :
glm_spec <- logistic_reg()%>%
  set_engine("glm")

glm_fit <- glm_spec %>%
  fit(truth ~ lv + lcs + qgram + cosine + jaccard + jw , data= juice(uni_rec))

glm_fit

## Model evaluation with resampling :
set.seed(123)

folds <- vfold_cv(juice(uni_rec), strata = truth)

folds

#1: Logistic Reg:
set.seed(234)

glm_rs <- glm_spec%>%
  fit_resamples(truth ~ lv + lcs + qgram + cosine + jaccard + jw, folds, 
                metrics = metric_set(roc_auc, sens, spec, accuracy),
                control = control_resamples(save_pred = TRUE))

## Evaluation des modeles : 

glm_rs  %>% collect_metrics()

> glm_rs  %>% collect_metrics()
# A tibble: 4 x 6
  .metric  .estimator  mean     n std_err .config             
  <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy binary     0.851    10 0.00514 Preprocessor1_Model1
2 roc_auc  binary     0.898    10 0.00390 Preprocessor1_Model1
3 sens     binary     0.875    10 0.00695 Preprocessor1_Model1
4 spec     binary     0.827    10 0.00700 Preprocessor1_Model1

但是当我尝试将逻辑回归模型应用于测试数据时，我得到了这个错误：

> glm_fit %>%
+   predict(new_data = bake(uni_rec, new_data = uni_test),
+           type = "prob")%>%
+   mutate(truth = uni_test$truth)%>%
+   roc_auc(truth, .pred_correct)
Erreur : Problem with `mutate()` input `truth`.
x Input `truth` can't be recycled to size 2022.
i Input `truth` is `uni_test$truth`.
i Input `truth` must be size 2022 or 1, not 1373.
Run `rlang::last_error()` to see where the error occurred.

我认为这是因为配方中的小步骤，但我不知道如何解决它请帮忙！！

【问题讨论】：

您应该在 step_smote 中保留skip = TRUE。这确保该步骤仅适用于训练数据集。通过将其设置为 FALSE，您可以在预测没有意义时对其进行上采样，因为您希望在整个预测过程中获得相同数量的观察结果
谢谢，我试过你的方法，这就是解决它的方法

标签： r machine-learning tidymodels

【解决方案1】：

当您使用bake 时，您的测试集发生了变化。（@Emil Hvitfeldt 确定了原因。）我没有您使用的数据，但是当应用bake 时，我使用的数据只留下了结果变量（您的数据中的truth）。因此，您可以挂断电话至mutate。当我发现它按预期工作时，我发现roc_auc 中无法识别truth。

为了找到这些错误我跑了

fit.p <- gpl_fit %>% predict(new_data = bake(uni_rec, new_data = uni_test),
                             type = "prob")

然后我看了fit.p。什么对我的数据有用

nd = bake(uni_rec, new_data = uni_test)     

glm_fit %>%
  predict(new_data = nd,
          type = "prob") %>% 
  roc_auc(nd$vs, .pred_0)

【讨论】：