【问题标题】:Running regression for each line of data.table对每行 data.table 运行回归
【发布时间】:2026-02-13 14:50:01
【问题描述】:

我有一个带有K>1million 观察值和2xN cols 的大数据集。 N 列是 X-s,其他 N 列是 Y-s。 N比较小,N<<K

我想对Y_i = beta * X_i 形式的每个观察值运行回归,其中i=1,...N。因此,我需要运行 K 回归并得到 K betas: beta_1,...,beta_K。我需要以或多或少的高效方式来做到这一点。

这是我能做的最原始的代码。

set.seed(10)
# Data looks like this   
K = 15 # number of observations, in reality K is much larger say K = 100,000
# here implicitly N = 5, but it can vary, so if you can suggest a code that uses N as parameter it will be helpful.
data = 
  tibble(X1 = rnorm(K,0,1),
       X2 = rnorm(K,0,1),
       X3 = rnorm(K,0,1),
       X4 = rnorm(K,0,1),
       X5 = rnorm(K,0,1),
       Y1 = rnorm(K,0,1),
       Y2 = rnorm(K,0,1),
       Y3 = rnorm(K,0,1),
       Y4 = rnorm(K,0,1),
       Y5 = rnorm(K,0,1)) %>%
  as.data.table()
# Let's introduce some missed variables
data[3,5] = NA
data[5,2] = NA
data[7,4] = NA
data[7,5] = NA
# A dumb loop code will be
beta = 1:K
for(i in 1:K){
  reg_data = tibble(x = data[i,1:5] %>% t(),
         y = data[i,6:10] %>% t() )
  colnames(reg_data) = c("x","y")
  beta[i] = lm(y ~ x + 0, reg_data)$coefficients  
}
data$beta = beta
data

数据:

             X1          X2          X3          X4         X5          Y1          Y2          Y3          Y4          Y5        beta
 1:  0.01874617  0.08934727 -1.85374045 -0.02881534 -1.2375945  1.17270628 -0.41635467  0.56317466 -0.48136561 -0.98306919  0.03442037
 2: -0.18425254 -0.95494386 -0.07794607  0.23252515 -0.4561763 -1.47982702 -0.19148234  0.66098669  0.20288178  0.49533171  0.18549600
 3: -1.37133055 -0.19515038  0.96856634 -0.30120868         NA -0.43038782  0.06954478 -1.65805086 -0.03173974  0.72581750 -0.34597018
 4: -0.59916772  0.92552126  0.18492596 -0.67761458  0.3401156 -1.05163864  1.15534832  1.02816798 -1.19558030  0.66729873  1.60396614
 5:  0.29454513          NA -1.37994358  0.65522764  1.0663764  1.52258634  0.59495735  1.12795361  0.62368124  0.95478644  0.08960632
 6:  0.38979430 -0.59631064 -1.43551436 -0.40063755  1.2161258  0.59282805 -1.41964511 -1.28015460 -0.91480448 -1.67533218  0.29574377
 7: -1.20807618 -2.18528684  0.36208723          NA         NA -0.22266151 -1.60667725  1.12886823  0.24875801 -1.20518539  0.65799077
 8: -0.36367602 -0.67486594 -1.75908675  1.36795395 -0.4812086  0.71289428  0.89292590 -0.46413453 -1.06262279 -1.96325249 -0.09581759
 9: -1.62667268 -2.11906119 -0.32454401  2.13776710  0.5627448  0.71660083  0.14816796 -0.31576021 -0.36398225  1.47075231 -0.10946286
10: -0.25647839 -1.26519802 -0.65156299  0.50581926 -1.2463197  0.44024186  1.22702839  0.92429315 -1.20699485  0.37247234 -0.85696018
11:  1.10177950 -0.37366156  1.08655140  0.78634238  0.3809222  0.15883062 -0.76180434  0.07714472  1.42921278  1.06587933  0.62874846
12:  0.75578151 -0.68755543 -0.76254488 -0.90221194 -1.4304273  0.65976414  0.41937541  1.03992361  0.63343589  0.53064987 -0.42653775
13: -0.23823356 -0.87215883 -0.82866254  0.53289699 -1.0484455  2.22051966 -1.03994336  0.74188621 -1.99681562  0.10198345 -0.48758134
14:  0.98744470 -0.10176101  0.83447390 -0.64589425 -0.2185036 -1.18394507  0.71157397  1.25554486 -0.68183217  1.33778247 -0.02128415
15:  0.74139013 -0.25378053 -0.96765199  0.29098749 -1.4899362 -0.07395583 -0.63321301  0.95091897 -0.46005548  0.08723477 -0.27967233

.

什么是有效的 data.table 方式来做到这一点?

【问题讨论】:

  • 我不认为 data.table 可以为您节省更多时间。贵的是装修。一个建议是使用 lm.fit 而不是 lm,因为您只需要系数
  • 谢谢你,@StupidWolf。是的,对于 N=10,我终于写了类似``` find_beta % .[, beta := find_beta(B1,B2, B3, B4 , B5, B6,B7, B8, B9, B10, V1, V2, V3,V4, V5, V6,V7, V8, V9, V10), by = seq_len(nrow(data)) ]```, 即很奇怪,但除非遗漏了一些变量,否则可以工作。
  • 哦,是的,它无法处理丢失.. 你有很多吗?
  • 你应该可以做到 lm.fit(y=data[i,6:10] ,x= data[i,)1:5])
  • @StupidWolf 感谢您指出 lm.fit 函数。我认为下面给出的基本上手动计算 beta 的解决方案是一个显着的改进。

标签: r data.table


【解决方案1】:

这是一个将其转换为长格式的选项,然后使用找到的方程计算线性回归中的系数 here

DT <- melt(data[, rn := .I], id.vars="rn", measure.vars=patterns(c("X", "Y")),
    na.rm=TRUE, value.name=c("X","Y"))[order(rn)]

DT[, {
        sumx <- sum(X)
        sumy <- sum(Y)
        sumxsq <- sum(X^2)
        sumxy <- sum(X*Y)
        b <- (sumxy - sumx * sumy / .N) / (sumxsq - sumx^2 / .N)
        .(`(Intercept)`=sumy / .N - b * sumx / .N, X=b)
    }, 
    rn]

#for comparison
#DT[, as.list(lm(Y ~ X, .SD)$coefficients), rn]

输出:

    rn (Intercept)           X
 1:  1 -0.01297769  0.02656658
 2:  2 -0.01363072  0.16932027
 3:  3 -0.63389115 -0.53933759
 4:  4  0.06518737  1.59775760
 5:  5  1.07353565 -0.10238037
 6:  6 -0.92042107  0.11494024
 7:  7  0.83134259  1.05384614
 8:  8 -0.47319334 -0.25212270
 9:  9  0.31078353 -0.07436378
10: 10 -0.26208273 -1.05275512
11: 11  0.04097118  0.59169896
12: 12  0.67359486  0.02802331
13: 13 -0.40133349 -0.82876842
14: 14  0.31281032 -0.14598437
15: 15 -0.14023767 -0.34075091

【讨论】:

  • 按组拆分表达式可能会更快,因此至少部分表达式将使用 GForce 优化分组
  • 是的,同意。之所以不在这里做是因为每个 X 和 Y 只会被使用一次。所以不确定取出所有 sumx 和 sumy 并使用 GForce 会有多大差异
  • 谢谢@chinsoon12。这是一个非常优雅的答案。这是K=10000 的性能比较:我的初始哑代码为 13.7 秒,DT[, as.list(lm(Y ~ X, .SD)$coefficients), rn] 为 4.5 秒,而您的“手动”计算系数为 0.037 秒。它几乎提高了 500 倍。