【发布时间】:2026-02-13 14:50:01
【问题描述】:
我有一个带有K>1million 观察值和2xN cols 的大数据集。 N 列是 X-s,其他 N 列是 Y-s。 N比较小,N<<K。
我想对Y_i = beta * X_i 形式的每个观察值运行回归,其中i=1,...N。因此,我需要运行 K 回归并得到 K betas: beta_1,...,beta_K。我需要以或多或少的高效方式来做到这一点。
这是我能做的最原始的代码。
set.seed(10)
# Data looks like this
K = 15 # number of observations, in reality K is much larger say K = 100,000
# here implicitly N = 5, but it can vary, so if you can suggest a code that uses N as parameter it will be helpful.
data =
tibble(X1 = rnorm(K,0,1),
X2 = rnorm(K,0,1),
X3 = rnorm(K,0,1),
X4 = rnorm(K,0,1),
X5 = rnorm(K,0,1),
Y1 = rnorm(K,0,1),
Y2 = rnorm(K,0,1),
Y3 = rnorm(K,0,1),
Y4 = rnorm(K,0,1),
Y5 = rnorm(K,0,1)) %>%
as.data.table()
# Let's introduce some missed variables
data[3,5] = NA
data[5,2] = NA
data[7,4] = NA
data[7,5] = NA
# A dumb loop code will be
beta = 1:K
for(i in 1:K){
reg_data = tibble(x = data[i,1:5] %>% t(),
y = data[i,6:10] %>% t() )
colnames(reg_data) = c("x","y")
beta[i] = lm(y ~ x + 0, reg_data)$coefficients
}
data$beta = beta
data
数据:
X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4 Y5 beta
1: 0.01874617 0.08934727 -1.85374045 -0.02881534 -1.2375945 1.17270628 -0.41635467 0.56317466 -0.48136561 -0.98306919 0.03442037
2: -0.18425254 -0.95494386 -0.07794607 0.23252515 -0.4561763 -1.47982702 -0.19148234 0.66098669 0.20288178 0.49533171 0.18549600
3: -1.37133055 -0.19515038 0.96856634 -0.30120868 NA -0.43038782 0.06954478 -1.65805086 -0.03173974 0.72581750 -0.34597018
4: -0.59916772 0.92552126 0.18492596 -0.67761458 0.3401156 -1.05163864 1.15534832 1.02816798 -1.19558030 0.66729873 1.60396614
5: 0.29454513 NA -1.37994358 0.65522764 1.0663764 1.52258634 0.59495735 1.12795361 0.62368124 0.95478644 0.08960632
6: 0.38979430 -0.59631064 -1.43551436 -0.40063755 1.2161258 0.59282805 -1.41964511 -1.28015460 -0.91480448 -1.67533218 0.29574377
7: -1.20807618 -2.18528684 0.36208723 NA NA -0.22266151 -1.60667725 1.12886823 0.24875801 -1.20518539 0.65799077
8: -0.36367602 -0.67486594 -1.75908675 1.36795395 -0.4812086 0.71289428 0.89292590 -0.46413453 -1.06262279 -1.96325249 -0.09581759
9: -1.62667268 -2.11906119 -0.32454401 2.13776710 0.5627448 0.71660083 0.14816796 -0.31576021 -0.36398225 1.47075231 -0.10946286
10: -0.25647839 -1.26519802 -0.65156299 0.50581926 -1.2463197 0.44024186 1.22702839 0.92429315 -1.20699485 0.37247234 -0.85696018
11: 1.10177950 -0.37366156 1.08655140 0.78634238 0.3809222 0.15883062 -0.76180434 0.07714472 1.42921278 1.06587933 0.62874846
12: 0.75578151 -0.68755543 -0.76254488 -0.90221194 -1.4304273 0.65976414 0.41937541 1.03992361 0.63343589 0.53064987 -0.42653775
13: -0.23823356 -0.87215883 -0.82866254 0.53289699 -1.0484455 2.22051966 -1.03994336 0.74188621 -1.99681562 0.10198345 -0.48758134
14: 0.98744470 -0.10176101 0.83447390 -0.64589425 -0.2185036 -1.18394507 0.71157397 1.25554486 -0.68183217 1.33778247 -0.02128415
15: 0.74139013 -0.25378053 -0.96765199 0.29098749 -1.4899362 -0.07395583 -0.63321301 0.95091897 -0.46005548 0.08723477 -0.27967233
.
什么是有效的 data.table 方式来做到这一点?
【问题讨论】:
-
我不认为 data.table 可以为您节省更多时间。贵的是装修。一个建议是使用 lm.fit 而不是 lm,因为您只需要系数
-
谢谢你,@StupidWolf。是的,对于 N=10,我终于写了类似``` find_beta % .[, beta := find_beta(B1,B2, B3, B4 , B5, B6,B7, B8, B9, B10, V1, V2, V3,V4, V5, V6,V7, V8, V9, V10), by = seq_len(nrow(data)) ]```, 即很奇怪,但除非遗漏了一些变量,否则可以工作。
-
哦,是的,它无法处理丢失.. 你有很多吗?
-
你应该可以做到 lm.fit(y=data[i,6:10] ,x= data[i,)1:5])
-
@StupidWolf 感谢您指出 lm.fit 函数。我认为下面给出的基本上手动计算 beta 的解决方案是一个显着的改进。
标签: r data.table