根据字符值将数据帧拆分为子集答案

【问题标题】：split dataframe into subsets based on character value根据字符值将数据帧拆分为子集
【发布时间】：2018-06-16 13:44:33
【问题描述】：

我想对调查数据中受访者的 4 个不同社会经济水平执行相同的回归。

例如：

educational_level (of subset 1) = ß0 + ß1*educational_level_father + ß2*race + ... +u 

educational_level (of subset 2)= ß0 + ß1*educational_level_father + ß2*race + ... +u

...等等。如何根据其中一个特定变量（列）的值划分data.frame？

【问题讨论】：

潜在重复列表：fit model to multiple groupings or subsets and extract original factor columns for data frame output、Splitting data and fitting distributions efficiently、Fit a different model for each row of a list-columns data frame。
您应该注意 Stack Overflow (SO) 不是一个代码编写服务，而是一个问答网站。请花一些时间阅读帮助页面，尤其是名为"What topics can I ask about here?" 和"What types of questions should I avoid asking?" 的部分。更重要的是，请阅读the Stack Overflow question checklist。您可能还想了解Minimal, Complete, and Verifiable Examples。

标签： r subset linear-regression logistic-regression

【解决方案1】：

一种方法是遍历子集列中的唯一值。看看for和subset：

> data("iris")  ## A data set
> unique_species <- unique(iris$Species)  ## Get the unique values of the subsetting column
> results <- list()  ## Set up a list to store the regressions you will run within the loop
> for (species in unique_species) {  ## Loop over each unique value
+     data_subset <- subset(iris, iris$Species == species)  ## Subset based on the desired value
+     results[[species]] <- lm(Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
                               data=data_subset)  ## Run each regression
+ }

这将产生：

> results
$setosa

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
    data = data_subset)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      2.3519        0.6548        0.2376        0.2521  


$versicolor

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
    data = data_subset)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      1.8955        0.3869        0.9083       -0.6792  


$virginica

Call:
lm(formula = Sepal.Length ~ Sepal.Width + Petal.Length + Petal.Width, 
    data = data_subset)

Coefficients:
 (Intercept)   Sepal.Width  Petal.Length   Petal.Width  
      0.6999        0.3303        0.9455       -0.1698

对于只有 4 个级别，这应该是相当有效的。

【讨论】：

【解决方案2】：

base-R 解决方案是：

dat.list <- split(x=YourData, f = as.factor(YourData$YourCharacter)
summary(lm(educ ~ educ_father, data=dat.list[[1]]))
summary(lm(educ ~ educ_father, data=dat.list[[2]]))
summary(lm(educ ~ educ_father, data=dat.list[[3]]))
summary(lm(educ ~ educ_father, data=dat.list[[4]]))

或者，您可以将回归结果分配给带有一点 for 循环的列表。

如果您正在寻找更高效的解决方案（即您拥有大数据），您应该实施nest-map-unnest 工作流程。我个人的偏好是依赖broom、purr 和dplyr 包，它们是tidyverse 的一部分。您可以检查来自this vignette 的一些代码。其他解决方案当然是可能的。

【讨论】：