【问题标题】:Factor Levels and Modelling in RR中的因子水平和建模
【发布时间】:2020-01-27 23:24:23
【问题描述】:

以下代码运行一个非常简单的lm(),并尝试在一个小数据框中总结结果(因子水平、系数):

df <- data.frame(star_sign = c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces"),
                 y = c(1.1, 1.2, 1.4, 1.3, 1.8, 1.6, 1.4, 1.3, 1.2, 1.1, 1.5, 1.3))

levels(df$star_sign) #alphabetical order

# fit a simple linear model

my_lm <- lm(y ~ 1 + star_sign, data = df)
summary(my_lm) # intercept is based on first level of factor, aquarius

# I want the levels to work properly 1..12 = Aries, Taurus...Pisces so I'm going to redefine the factor levels

df$my_levels <- c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces")

df$star_sign <- factor(df$star_sign, levels = df$my_levels)

my_lm <- lm(y ~ 1 + star_sign_, data = df)
summary(my_lm) # intercept is based on first level of factor which is now Aries

# but for my model fit I want the reference level to be Virgo (because reasons)

df$star_sign_2 <- relevel(df$star_sign, ref = "Virgo")

my_lm <- lm(y ~ 1 + star_sign_2, data = df)
summary(my_lm)

df_results <- data.frame(factor_level = names(my_lm$coefficients), coeff = my_lm$coefficients )

# tidy up
rownames(df_results) <- 1:12
df_results$factor_level <- as.factor(gsub("star_sign_2", "", df_results$factor_level))

# change label of "(Intercept)" to "Virgo"
df_results$factor_level <- plyr::revalue(df_results$factor_level, c("(Intercept)" = "Virgo"))

levels(df_results$factor_level) # the levels are alphabetical + Virgo at the front (not same as display order from lm)

因子水平的顺序不正确:我想对df_results 进行排序,以便星号以与它们最初(白羊座、金牛座...双鱼座)相同的顺序出现,如@ 987654324@专栏。我认为我对操纵因素及其标签/级别等没有很好的了解,所以我很难知道如何做到这一点。

这也是一段冗长而笨拙的代码。有没有更简洁的方法来做这种事情?

谢谢。

(ps 从数学上讲,模型显然是微不足道的,但对于这些目的来说没关系 - 我只是对如何操作输出感兴趣)

【问题讨论】:

    标签: r lm


    【解决方案1】:

    以下是我使用broom 包(和dplyr)提取模型系数的方法:

    library(broom)
    library(dplyr)
    broom::tidy(my_lm) %>%
      mutate(term = sub("star_sign_2", "", term),
             term = ifelse(term == "(Intercept)", "Virgo", term),
             term = factor(term, levels = unique(term)))
    # A tibble: 12 x 5
       term        estimate std.error statistic p.value
       <fct>          <dbl>     <dbl>     <dbl>   <dbl>
     1 Virgo          1.6         NaN       NaN     NaN
     2 Aries         -0.500       NaN       NaN     NaN
     3 Taurus        -0.4         NaN       NaN     NaN
     4 Gemini        -0.2         NaN       NaN     NaN
     5 Cancer        -0.300       NaN       NaN     NaN
     6 Leo            0.20        NaN       NaN     NaN
     7 Libra         -0.2         NaN       NaN     NaN
     8 Scorpio       -0.3         NaN       NaN     NaN
     9 Sagittarius   -0.4         NaN       NaN     NaN
    10 Capricorn     -0.500       NaN       NaN     NaN
    11 Aquarius      -0.1         NaN       NaN     NaN
    12 Pisces        -0.300       NaN       NaN     NaN
    

    设置levels = unique(term) 是一个很好的技巧,可以将关卡按出现的顺序排列。

    我的另一个建议是在数据框中按照您希望的顺序保留级别向量,然后在需要建立顺序时参考它。例如,

    astro_order = c("Aries", "Taurus", "Gemini", "Cancer", "Leo", "Virgo", "Libra", "Scorpio", "Sagittarius", "Capricorn", "Aquarius", "Pisces")
    
    # messy but effective:
    astro_order_virgo1 = factor(astro_order, levels = astro_order) %>% 
      relevel("Virgo") %>%
      levels()
    

    那么你可以用term = factor(term, levels = astro_order_virgo1)替换上面的最后一步。

    这种保持级别顺序分开的方法很好,因为 (a) 如果您重新排序数据框,它不会改变,并且 (b) 如果您的数据框很长并且您重复输入你的因素水平。

    【讨论】:

      【解决方案2】:

      如果我了解您需要做什么,这非常简单。 只需在脚本末尾添加以下代码。我还鼓励您深入研究 dplyr 或 tidyverse。 如果您有任何问题,请告诉我:)

      ## ADDED: 
      
      #WE CREATE AN ID to maintain order in df_results 
      df$id <- 1:nrow(df)
      
      
      library(dplyr)
      #Perform left _ join (you could also do inner or right, you'll get the same result in this case )
      df_results = left_join(df_results,df, by=c('factor_level'='star_sign_2'))
      df_results = df_results %>% arrange(id)
      
      # select desired columns (optionally) 
      df_results = df_results %>% select(factor_level,coeff) 
      
      
      head(df_results)
      
       factor_level coeff
      1        Aries  -0.5
      2       Taurus  -0.4
      3       Gemini  -0.2
      4       Cancer  -0.3
      5          Leo   0.2
      6        Virgo   1.6
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2017-02-21
        • 2012-07-20
        • 1970-01-01
        • 1970-01-01
        • 2014-09-11
        • 2014-11-14
        • 2023-04-01
        相关资源
        最近更新 更多