【问题标题】:Tidymodels - Help evaluating regression models made via recipesTidymodels - 帮助评估通过配方制作的回归模型
【发布时间】:2021-08-12 08:59:39
【问题描述】:

我正在处理当前关于工资的 tidytuesday 数据,并尝试使用 tidymodels 和食谱创建一个模型。我想用食谱代码中的许多其他因素来预测薪水,但我遇到了一个问题。

问题 1 - 我的食谱说有空行,但我不知道如何弄清楚。这不会给出错误,所以也许这不是问题。

问题 2 - 了解我的模型实际做了什么以及如何可视化性能。我想在初始数据上绘制模型性能。这是我的目标示例:https://indescribled.files.wordpress.com/2021/05/image-17.png?w=782

我不明白如何在我的食谱中使用预测功能。 juice(rec) 少于 1000 行,而测试数据约为 6000。也许我正在向后阅读,但有人可以尝试指出我正确的方向吗?

下面的代码应该是我的精确复制。

library(tidymodels)
library(tidyverse)

salary_raw <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-18/survey.csv')

# Could not figure out tidy way to do this
salary_raw$other_monetary_comp[is.na(salary_raw$other_monetary_comp)] <- 0
salary_raw$other_monetary_comp <- as.numeric(salary_raw$other_monetary_comp)

# Filter and convert to USD
# The mutations to industry were because of other errors, they may not be needed

salary_modeling <- salary_raw %>% 
  filter(
    how_old_are_you %in% c("55-64","45-54","35-44","25-34","18-24"),
    currency %in% c("AUD/NZD","CAD","EUR","GBP","USD")
  ) %>% 
  mutate(annual_salary = case_when(
    currency == "USD" ~ annual_salary * 1.00,
    currency == "GBP" ~ annual_salary * 1.42,
    currency == "AUD/NZD" ~ annual_salary * 0.75,
    currency == "CAD" ~ annual_salary * 0.83,
    currency == "EUR" ~ annual_salary * 1.22
  )) %>% 
  mutate(other_monetary_comp = case_when(
    currency == "USD" ~ other_monetary_comp * 1.00,
    currency == "GBP" ~ other_monetary_comp * 1.42,
    currency == "AUD/NZD" ~ other_monetary_comp * 0.75,
    currency == "CAD" ~ other_monetary_comp * 0.83,
    currency == "EUR" ~ other_monetary_comp * 1.22
  )) %>% 
  rename(age = how_old_are_you,
         prof_exp = overall_years_of_professional_experience,
         field_exp = years_of_experience_in_field,
         education = highest_level_of_education_completed
  ) %>% 
  mutate(total_comp = annual_salary + other_monetary_comp) %>% 
  filter(total_comp > 10000,
         total_comp < 350000)%>% 
  mutate(gender = case_when(
    gender == "Prefer not to answer" ~ "Other or prefer not to answer",
    TRUE ~ gender
  )) %>% 
  mutate(industry = case_when(
    industry == "Biotech pharmaceuticals" ~ "Biotech",
    industry == "Consumer Packaged Goods" ~ "Consumer packaged goods ",
    industry == "Real Estate Development" ~ "Real Estate",
    TRUE ~ industry
  ))

# Create initial splits

set.seed(123)
salary_split <- initial_split(salary_modeling)
salary_train <- training(salary_split)
salary_test <- testing(salary_split)

# I want to predict total comp with many of the other variables, listed below. Here is my logic
# downsample is because there are a lot more women than men in the data, unsure if necessary
# log is to many data more interpretable, not necessary
# an error message told me to use novel
# unknown is to change NA to unknown as far as I understand
# other is to change values that are less than 5% of the total dataset to "Other"
# unsure what the purpose of dummy is, but it seems to be necessary for modeling

rec <- salary_train %>% 
  recipe(total_comp ~ age + gender + field_exp + race + industry + job_title) %>%
  themis::step_downsample(gender) %>%
  step_log(total_comp, base = 2) %>%
  step_novel(race, industry) %>% 
  step_unknown(race, industry, gender) %>%
  step_other(race, industry, job_title, threshold = .005) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  prep()

# ISSUE 1 - Running rec says that there are 19,081 data points and 226 incomplete rows. I do not know how to fix the incomplete rows

test_proc <- bake(rec, new_data = salary_test)

# Linear model ------------------------------------------------------------

lm_spec <- linear_reg() %>% 
  set_engine("lm")

lm_fitted <- lm_spec %>% 
  fit(total_comp ~ ., data = juice(rec))
tidy(lm_fitted)

# RF MODEL ----------------------------------------------------------------

rf_spec <- rand_forest(mode = "regression", trees = 1500) %>% 
  set_engine("ranger")

rf_fit <- rf_spec %>% 
  fit(total_comp ~ ., 
      data = juice(rec))  
rf_fit

# QUESTIONS BEGIN HERE --------------------------------------------------------------------------------------------------------------------------------------------------

# Need to figure out what new data is for this portion
# I think it is juice(rec), but it seems weird to me
# juice(rec) is only about 900 rows while test_proc is multiple thousand. testing data should be smaller than training
asdf <- juice(rec)

results_train <- lm_fitted %>% 
  predict(new_data = asdf) %>% 
  mutate(
    truth = asdf$total_comp,
    model = "lm"
  ) %>% 
  bind_rows(rf_fit %>% 
              predict(new_data = asdf) %>% 
              mutate(
                truth = asdf$total_comp,
                model = "rf"
              ))
results_train

# Is the newdata and test proc correct?
results_test <- lm_fitted %>% 
  predict(new_data = test_proc) %>% 
  mutate(
    truth = test_proc$total_comp,
    model = "lm"
  ) %>%
  bind_rows(rf_fit %>%
              predict(new_data = test_proc) %>%
              mutate(
                truth = test_proc$total_comp,
                model = "rf"
              ))
results_test

# Goal is to run the following code to visualize the predictions, the code below probably will do nothing right now unless the two dataframes above are correct

results_test %>% 
  mutate(train = "testing") %>% 
  bind_rows(results_train %>% 
              mutate(train = "training")) %>% 
  ggplot(aes(truth, .pred, color = model)) +
    geom_abline(lty = 2, color = "gray80", size = 1.5) +
    geom_point(alpha = .75) +
    facet_wrap(~train)

【问题讨论】:

    标签: r tidymodels


    【解决方案1】:

    看来你们相处得很好!

    library(tidymodels)
    library(tidyverse)
    
    salary_raw <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-05-18/survey.csv')
    #> 
    #> ── Column specification ────────────────────────────────────────────────────────
    #> cols(
    #>   timestamp = col_character(),
    #>   how_old_are_you = col_character(),
    #>   industry = col_character(),
    #>   job_title = col_character(),
    #>   additional_context_on_job_title = col_character(),
    #>   annual_salary = col_double(),
    #>   other_monetary_comp = col_character(),
    #>   currency = col_character(),
    #>   currency_other = col_character(),
    #>   additional_context_on_income = col_character(),
    #>   country = col_character(),
    #>   state = col_character(),
    #>   city = col_character(),
    #>   overall_years_of_professional_experience = col_character(),
    #>   years_of_experience_in_field = col_character(),
    #>   highest_level_of_education_completed = col_character(),
    #>   gender = col_character(),
    #>   race = col_character()
    #> )
    
    salary_modeling <- salary_raw %>% 
      replace_na(list(other_monetary_comp = 0)) %>%
      filter(
        how_old_are_you %in% c("55-64","45-54","35-44","25-34","18-24"),
        currency %in% c("AUD/NZD","CAD","EUR","GBP","USD")
      ) %>% 
      mutate(annual_salary = case_when(
        currency == "USD" ~ annual_salary * 1.00,
        currency == "GBP" ~ annual_salary * 1.42,
        currency == "AUD/NZD" ~ annual_salary * 0.75,
        currency == "CAD" ~ annual_salary * 0.83,
        currency == "EUR" ~ annual_salary * 1.22
      )) %>% 
      mutate(other_monetary_comp = parse_number(other_monetary_comp),
             other_monetary_comp = case_when(
        currency == "USD" ~ other_monetary_comp * 1.00,
        currency == "GBP" ~ other_monetary_comp * 1.42,
        currency == "AUD/NZD" ~ other_monetary_comp * 0.75,
        currency == "CAD" ~ other_monetary_comp * 0.83,
        currency == "EUR" ~ other_monetary_comp * 1.22
      )) %>% 
      rename(age = how_old_are_you,
             prof_exp = overall_years_of_professional_experience,
             field_exp = years_of_experience_in_field,
             education = highest_level_of_education_completed
      ) %>% 
      mutate(total_comp = annual_salary + other_monetary_comp) %>% 
      filter(total_comp > 10000,
             total_comp < 350000) %>% 
      mutate(gender = case_when(
        gender == "Prefer not to answer" ~ "Other or prefer not to answer",
        TRUE ~ gender
      )) %>% 
      mutate(industry = case_when(
        industry == "Biotech pharmaceuticals" ~ "Biotech",
        industry == "Consumer Packaged Goods" ~ "Consumer packaged goods ",
        industry == "Real Estate Development" ~ "Real Estate",
        TRUE ~ industry
      ))
    
    set.seed(123)
    salary_split <- initial_split(salary_modeling)
    salary_train <- training(salary_split)
    salary_test <- testing(salary_split)
    
    rec <- salary_train %>% 
      recipe(total_comp ~ age + gender + field_exp + race + industry + job_title) %>%
      themis::step_downsample(gender) %>%
      step_log(total_comp, base = 2) %>%
      step_novel(race, industry) %>% 
      step_unknown(race, industry, gender) %>%
      step_other(race, industry, job_title, threshold = 0.005) %>%
      step_dummy(all_nominal_predictors())
    

    这里的配方说训练数据的行不完整是因为缺少数据;这就是你使用step_unknown()的原因,我猜。

    prep(rec)
    #> Data Recipe
    #> 
    #> Inputs:
    #> 
    #>       role #variables
    #>    outcome          1
    #>  predictor          6
    #> 
    #> Training data contained 19080 data points and 235 incomplete rows. 
    #> 
    #> Operations:
    #> 
    #> Down-sampling based on gender [trained]
    #> Log transformation on total_comp [trained]
    #> Novel factor level assignment for race, industry [trained]
    #> Unknown factor level assignment for race, industry, gender [trained]
    #> Collapsing factor levels for race, industry, job_title [trained]
    #> Dummy variables from age, gender, field_exp, race, industry, job_title [trained]
    

    这里处理的训练集因为下采样,不再有那么多的观察了;我们不会对测试集应用下采样,因为我们想计算测试集上的指标,因为它们会“在野外”出现。

    train_proc <- rec %>% prep() %>% bake(new_data = NULL)
    test_proc  <- rec %>% prep() %>% bake(new_data = salary_test)
    dim(train_proc)
    #> [1] 878  57
    dim(test_proc)
    #> [1] 6361   57
    
    
    lm_fitted <- linear_reg() %>% 
      set_engine("lm") %>%
      fit(total_comp ~ ., data = train_proc)
    
    rf_fitted <- rand_forest(mode = "regression", trees = 1500) %>% 
      set_engine("ranger") %>%
      fit(total_comp ~ ., data = train_proc)
    

    我会尝试使用augment() 函数来进行可视化

    bind_rows(
      augment(lm_fitted, new_data = train_proc) %>% mutate(model = "lm", train = "train"),
      augment(rf_fitted, new_data = train_proc) %>% mutate(model = "rf", train = "train"),
      augment(lm_fitted, new_data = test_proc) %>% mutate(model = "lm", train = "test"),
      augment(rf_fitted, new_data = test_proc) %>% mutate(model = "rf", train = "test")
    ) %>%
      ggplot(aes(total_comp, .pred, color = model)) +
      geom_abline(lty = 2, color = "gray80", size = 1.5) +
      geom_point(alpha = .5) +
      facet_wrap(~ train)
    
    

    reprex package (v2.0.0) 于 2021 年 5 月 24 日创建

    【讨论】:

      猜你喜欢
      • 2016-10-03
      • 2020-01-28
      • 2017-12-15
      • 1970-01-01
      • 2014-06-13
      • 1970-01-01
      • 1970-01-01
      • 2011-07-04
      相关资源
      最近更新 更多