在R中同时通过许多因素子集数据帧答案

【问题标题】：Subset dataframe by many factors at the same time in R在R中同时通过许多因素子集数据帧
【发布时间】：2019-01-31 12:40:31
【问题描述】：

我有一个包含几个变量的数据框：地区、季节、年份、海拔和响应（这里是一个例子）：

region   season   year   altitud   response
IT       wint     2013   800       45
IT       wint     2013   815       47
IT       wint     2013   840       54
IT       wint     2014   800       49
IT       wint     2014   815       59

等等。有三个地区，四个季节和两年，我想在海拔和响应之间进行几次线性建模和绘图，根据所有可能的组合对数据进行子集化。即

subset(region&season&year)   and get  altitud~response
IT&wint&2013
IT&wint&2014
IT&spring&2013
IT&spring&2014

等等。因此，24 种组合。有什么想法吗？

非常感谢您

大卫

【问题讨论】：

我认为使用purrr的许多模型。
您可以使用split() 获取您的子集列表。 ...然后lapply()

标签： r dataframe subset factors

【解决方案1】：

我的解决方案使用 broom 和 tidy 函数。

读取数据：

library(readr)

data <- read_table("region   season   year   altitud   response
IT       wint     2013   800       45
IT       wint     2013   815       47
IT       wint     2013   840       54
IT       wint     2014   800       49
IT       wint     2014   815       59")

实际解决方案：

library(dplyr)
library(broom)
data_fit <- data %>%
    group_by(region, season, year) %>%
    do(fit = lm(altitud ~ response, data = .))

dfCoefs <- tidy(data_fit, fit)
dfCoefs

这给出了示例数据的以下回归系数：

# A tibble: 4 x 8
# Groups:   region, season, year [2]
  region season  year term        estimate std.error statistic  p.value
  <chr>  <chr>  <dbl> <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 IT     wint    2013 (Intercept)   613.      34.7       17.7    0.0360
2 IT     wint    2013 response        4.22     0.711      5.93   0.106 
3 IT     wint    2014 (Intercept)   726.     NaN        NaN    NaN     
4 IT     wint    2014 response        1.5    NaN        NaN    NaN

不过，您想要altitud ~ response（即根据响应预测高度）还是response ~ altitud（根据高度预测响应？）

【讨论】：

非常感谢。我对 R 编程很陌生，所以我创建了一个不那么温和的代码（使用 split 和 tidy 函数）。无论如何，我会试试你的。响应是 Y，高度是 X
没问题，我希望一切顺利。请记住使用Y ~ X 进行回归公式（在本例中为response ~ altitud）。 :) 您可以阅读formula 函数的文档以获取更多信息。

【解决方案2】：

希望我没听错，这里有一个 purrr 解决方案：

library(purrr)
library(dplyr)
nested<-df %>% 
  mutate_if(is.character,as.factor) %>% 
  group_by(year,season,region) %>% 
  nest()
my_model<-function(df){
  lm(altitud~response,data=df)
}

nested %>% 
  mutate(Mod=map(data,my_model))

结果：部分修改数据以获得因子。

 A tibble: 3 x 5
   year season region data             Mod     
  <int> <fct>  <fct>  <list>           <list>  
1  2013 wint   IT     <tibble [3 x 2]> <S3: lm>
2  2014 wint   IT     <tibble [1 x 2]> <S3: lm>
3  2014 Summer IF     <tibble [1 x 2]> <S3: lm>

使用modelr 进行预测。您可以使用broom 获取统计信息，如另一个答案所示。

require(modelr)
nested %>% 
  mutate(Mod=map(data,my_model)) %>% 
  mutate(Preds=map2(data,Mod,add_predictions)) %>% 
  unnest(Preds)
# A tibble: 5 x 6
   year season region altitud response  pred
  <int> <fct>  <fct>    <int>    <int> <dbl>
1  2013 wint   IT         800       45  44.4
2  2013 wint   IT         815       47  47.9
3  2013 wint   IT         840       54  53.7
4  2014 wint   IT         800       49  49  
5  2014 Summer IF         815       59  59

使用broom 和purrr 获取摘要统计信息：

# A tibble: 4 x 8
   year season region term        estimate std.error statistic p.value
  <int> <fct>  <fct>  <chr>          <dbl>     <dbl>     <dbl>   <dbl>
1  2013 wint   IT     (Intercept) -140.      31.8        -4.40   0.142
2  2013 wint   IT     altitud        0.231    0.0389      5.93   0.106
3  2014 wint   IT     (Intercept)   49      NaN         NaN    NaN    
4  2014 Summer IF     (Intercept)   59      NaN         NaN    NaN

nested %>% 
  mutate(Mod=map(data,my_model)) %>% 
  mutate(Preds=map2(data,Mod,add_predictions),Tidy=map(Mod,tidy)) %>% 
  unnest(Tidy)

数据：

df<-read.table(text="region   season   year   altitud   response
IT       wint     2013   800       45
               IT       wint     2013   815       47
               IT       wint     2013   840       54
               IT       wint     2014   800       49
               IF       Summer     2014   815       59",header=T)

【讨论】：

【解决方案3】：

为了完整起见，这里还有base R和data.table的解决方案。

基础 R

使用split() 和lapply() 的一种可能的基本R 方法是suggested by Jogo：

result <- lapply(split(DT, list(DT$region, DT$season, DT$year)), 
                 lm, formula = response ~ altitud)
print(result)

$IT.wint.2013

Call:
FUN(formula = ..1, data = X[[i]])

Coefficients:
(Intercept)      altitud  
  -140.0510       0.2306  


$IT.wint.2014

Call:
FUN(formula = ..1, data = X[[i]])

Coefficients:
(Intercept)      altitud  
  -484.3333       0.6667

或者，使用管道来提高可读性

library(magrittr)
result <- split(DT, list(DT$region, DT$season, DT$year)) %>% 
  lapply(lm, formula = response ~ altitud)

数据表

在broom的帮助下：

library(data.table)
library(magrittr)
setDT(DT)[, lm(response ~ altitud, .SD) %>% broom::tidy(), by = .(region, season, year)]

   region season year        term     estimate   std.error statistic   p.value
1:     IT   wint 2013 (Intercept) -140.0510204 31.82553603 -4.400586 0.1422513
2:     IT   wint 2013     altitud    0.2306122  0.03888277  5.930962 0.1063382
3:     IT   wint 2014 (Intercept) -484.3333333         NaN       NaN       NaN
4:     IT   wint 2014     altitud    0.6666667         NaN       NaN       NaN

setDT(DT)[, lm(response ~ altitud, .SD) %>% broom::glance(), by = .(region, season, year)]

   region season year r.squared adj.r.squared    sigma statistic   p.value df    logLik      AIC    BIC deviance df.residual
1:     IT   wint 2013 0.9723576     0.9447152 1.111168  35.17631 0.1063382  2 -2.925132 11.85026 9.1461 1.234694           1
2:     IT   wint 2014 1.0000000           NaN      NaN       NaN       NaN  2       Inf     -Inf   -Inf 0.000000           0

如果为不同的组计算lm() 非常耗时，那么存储结果并将其用于后续处理步骤可能是值得的：

mod <- setDT(DT)[, .(model = .(lm(response ~ altitud, .SD))), by = .(region, season, year)]
mod

   region season year models
1:     IT   wint 2013   <lm>
2:     IT   wint 2014   <lm>

mod$models 是等效于result 的模型列表。

现在，我们可以从计算模型中提取所需的信息，例如，

mod[, models[[1]] %>% broom::tidy(), by = .(region, season, year)]

   region season year        term     estimate   std.error statistic   p.value
1:     IT   wint 2013 (Intercept) -140.0510204 31.82553603 -4.400586 0.1422513
2:     IT   wint 2013     altitud    0.2306122  0.03888277  5.930962 0.1063382
3:     IT   wint 2014 (Intercept) -484.3333333         NaN       NaN       NaN
4:     IT   wint 2014     altitud    0.6666667         NaN       NaN       NaN

数据

library(data.table)
DT <- fread("
region   season   year   altitud   response
IT       wint     2013   800       45
IT       wint     2013   815       47
IT       wint     2013   840       54
IT       wint     2014   800       49
IT       wint     2014   815       59")

【讨论】：