【发布时间】:2022-08-14 03:13:23
【问题描述】:
我有一个大型(数百万个观察值)数据集,并且我使用 feols 来运行线性模型。该模型已将许多观察结果从缺失值的考虑中剔除。我已经恢复了使用 $obs_selection 删除的行号,但我无法确定如何使用 $obs_selection 生成的列表从原始数据集中删除删除的观察结果。
最终,我想删除删除的观察结果,然后将 $residuals 加入原始数据。
我最初尝试过这个(通常 - 在下面的代码中指定):
df[-object$obs_selection]
但这会产生错误“-rows_to_delete 中的错误:一元运算符的无效参数”,并且类似于此问题的答案中的解决方案(以及我得到的错误):How do you retrieve the estimation sample in R?
在下面的样本数据中,由于缺失值,模型中省略了五个观测值。我将如何使用 fake_lm$obs_selection 从我的原始数据集中删除丢弃的观察?
谢谢!
数据:
structure(list(ExamType = c(\"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"G\",
\"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"G\", \"A\", \"B\", \"C\", \"D\", \"E\", \"F\",
\"G\", \"A\", \"B\", \"C\", \"D\", \"E\", \"F\", \"G\", \"A\", \"B\"), ExamScore = c(1L,
2L, 2L, 3L, 1L, 4L, 4L, 5L, 2L, 1L, 4L, 3L, 2L, 5L, 1L, NA, 3L,
2L, 1L, 2L, 5L, 4L, 4L, 3L, 1L, 2L, 5L, 4L, 3L, 1L), State = c(\"CA\",
\"CA\", \"AL\", \"AK\", \"AK\", \"CA\", \"AL\", \"CO\", \"AL\", \"CA\", \"CA\", \"CA\",
\"CO\", \"CO\", \"AR\", \"AR\", \"AK\", \"CA\", \"CA\", \"CT\", \"AL\", \"CA\", \"AK\",
\"CA\", \"CA\", \"AL\", \"AR\", \"AR\", \"CA\", \"CT\"), Male = c(1L, 1L, 0L,
0L, 1L, 0L, 0L, 0L, 1L, 1L, NA, 1L, 1L, 1L, 0L, 0L, 1L, 0L, 1L,
0L, 0L, 1L, 0L, 0L, 0L, 1L, 1L, 0L, 1L, 1L), White = c(1L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 1L, 1L, 0L, 0L, 1L, 0L, 0L, 0L), Black = c(0L,
1L, 0L, NA, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L,
0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L, 0L), Latinx = c(0L,
0L, 0L, 0L, 1L, 0L, NA, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L,
0L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 1L, 0L), X2.Race = c(0L,
0L, 0L, 1L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L,
0L, 0L, 0L, 0L, 0L, 0L, NA, 0L, 0L, 0L, 0L, 0L, 0L)), row.names = c(NA,
30L), class = \"data.frame\")
代码:
library(fixest)
fake_lm <- feols(ExamScore ~ Male + White + Black + Latinx + X2.Race | State, fake_data)
summary(fake_lm)
#These are the dropped observations
rows_to_delete <- fake_lm$obs_selection
# I would like to clean them from my dataset (fake_data), but this
# generates the error
fake_data[-rows_to_delete]
# Ultimately, once the original dataset only contains those used in the model, I\'ll add
# residuals as a column in my dataset
fake_data$resid <- fake_lm$residuals