替代R中的递归for循环答案

【问题标题】：Alternate to recursive for-loop in R替代R中的递归for循环
【发布时间】：2017-08-04 18:24:31
【问题描述】：

我对 R 比较陌生。

我有一个数据框 test，看起来像这样（纯文本只有 1 个变量 X1，但最多可以有 2000 万行）：

DP  - 2017 Jan 01
TI  - Case Report of Severe Antithrombin Deficiency During Extracorporeal Membrane
      Oxygenation and Therapeutic Plasma Exchange for Double Lung Transplantation.
PG  - 11-13
LID - 10.1213/XAA.0000000000000412 [doi]
AB  - Acquired antithrombin (AT) deficiency is not uncommon in cardiothoracic surgery
      because of heparin exposure and dilutional or consumptive losses. We report a
      case of acquired AT deficiency and resultant multiple deep vein thrombosis in a
      patient with pulmonary fibrosis on veno-venous extracorporeal membrane
AD  - From the Departments of *Anesthesiology and daggerCardiothoracic Surgery,
      University of Maryland, Baltimore, Maryland.
JT  - Saudi journal of kidney diseases and transplantation : an official publication of
      the Saudi Center for Organ Transplantation, Saudi Arabia
JID - 9436968

我想使用前面的标签为没有的行（也就是开头有 3 个空格）重新创建“标签”。但是，我只需要为 TI 和 JT 重新创建标签，因为这些将是我最终需要提取的唯一行。

所以基本上，我生成的数据框应该如下所示：

DP  - 2017 Jan 01
TI  - Case Report of Severe Antithrombin Deficiency During Extracorporeal Membrane
TI  - Oxygenation and Therapeutic Plasma Exchange for Double Lung Transplantation.
PG  - 11-13
LID - 10.1213/XAA.0000000000000412 [doi]
AB  - Acquired antithrombin (AT) deficiency is not uncommon in cardiothoracic surgery
      because of heparin exposure and dilutional or consumptive losses. We report a
      case of acquired AT deficiency and resultant multiple deep vein thrombosis in a
      patient with pulmonary fibrosis on veno-venous extracorporeal membrane
AD  - From the Departments of *Anesthesiology and daggerCardiothoracic Surgery,
      University of Maryland, Baltimore, Maryland.
JT  - Saudi journal of kidney diseases and transplantation : an official publication of
JT  - the Saudi Center for Organ Transplantation, Saudi Arabia
JID - 9436968

在没有“标签”的行前面有 3 个空格，所以这是我当前的代码：

for (n in 1:nrow(test))
{
  if (substr(test$X1[n], 1, 3) == "   " && (substr(test$X1[n-1], 1, 2) == "TI" || substr(test$X1[n-1], 1, 2) == "JT"))
  {
    if (n > 1)
    {
      subs <- substr(test$X1[[n-1]], 1, 6)
    }
    subs <- substr(test$X1[[n-1]], 1, 6)
    test$X1[n] <- sub("      ", subs, test$X1[n])
  }
}

我当前的解决方案有效，但要在超过 2000 万行的文本上运行需要很长时间。请告知，因为我需要在多个大文件上运行此脚本。

谢谢。

【问题讨论】：

第一个问：AB 和AD 从操作中排除了什么？第二问：需要操作后的数据顺序一致吗？ dput(head(df,8)) 在这里会有所帮助
我最终将重塑数据，使 TI 和 JT 成为变量名。我不需要AB 和AD，所以没有必要对它们执行它。是的，顺序也是一样的。
所以我从文件中提取了随机行，所以它根本不匹配，但代码输出如下：structure(list(X1 = c("STAT- MEDLINE", "IP - 23", "JT - The New England journal of medicine", "CIN - N Engl J Med. 2016 Dec 8;375(23 ):2286-2289. PMID: 27959676", "CIN - N Engl J Med. ;376(7):e11. PMID: 28207208", "CIN - N Engl J Med. ;376(7):e11. PMID: 28207209", "CIN - N Engl J Med. 2017 Feb 16;376(7):e11. PMID: 28199803", "DA - 20161213")), .Names = "X1", row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame"))
令人惊讶的是，for 循环运行良好。我尝试了dplyr、map_df 和lapply 解决方案，但它们在microbenchmarking 中的平均速度都较慢。接下来你应该考虑并行化...

标签： r performance for-loop recursion text

【解决方案1】：

1) 我稍微改写了你的函数：

yourFunction <- function(test) {
  for (n in 2:nrow(test)) {
    if (substr(test$X1[n], 1, 3) == "   " &&
        (substr(test$X1[n - 1], 1, 2) == "TI" ||
         substr(test$X1[n - 1], 1, 2) == "JT")) {
      subs <- substr(test$X1[[n - 1]], 1, 6)
      test$X1[n] <- sub("      ", subs, test$X1[n])
    }
  }
  test
}

2) 让我们创建一个小数据集来看看我们的两个函数是如何工作的：

# small test dataset:
require(data.table)

variants <-
  c("TI  - text", "      text2", "AD  - text3", "JT  - text4")
n <- 10
set.seed(26)
dt <- data.table(X1 = sample(variants, size = n, replace = T))
dt
             X1
 1:  TI  - text
 2:       text2
 3: JT  - text4
 4: JT  - text4
 5:       text2
 6:       text2
 7: JT  - text4
 8: AD  - text3
 9:       text2
10: AD  - text3

3) yourFunction 的结果：

yourFunction(dt)
             X1
 1:  TI  - text
 2: TI  - text2
 3: JT  - text4
 4: JT  - text4
 5: JT  - text2
 6: JT  - text2
 7: JT  - text4
 8: AD  - text3
 9:       text2
10: AD  - text3

4) 我使用zoo、data.table 和stringi 编写了这个函数~~（可能没有最后两个包你可以做得很好）~~

myFunction1 <- function(dt) {
  require(zoo)
  require(stringi)
  require(data.table)
  d <- copy(dt)
  d[, v6 := substr(X1, 1, 6)]
  # d[, v3 := substr(v6, 1, 3)]
  # d[, emty := ifelse(v3 == "   ", T, F)]
  d[v6 == "      ", v6 := NA]
  d[, v6 := na.locf(v6, na.rm = F)]
  d[is.na(v6), v6 := "      "]
  stri_sub(d$X1, 1, 6) <- d$v6
  d[, "X1", with = F]
}

5) 审核结果：

r1 <- yourFunction(dt)
r2 <- myFunction1(dt)
all.equal(r1, r2)
[1] "Column 'X1': 1 string mismatch"

r2
             X1
 1:  TI  - text
 2: TI  - text2
 3: JT  - text4
 4: JT  - text4
 5: JT  - text2
 6: JT  - text2
 7: JT  - text4
 8: AD  - text3
 9: AD  - text2
10: AD  - text3

结果不一样，我还重新创建了您不想要/不需要的标签。如果您需要移除它们，那么您可以找到一些方法，但这种方法要快得多。

6) 基准测试：（当 n 非常小时，您的函数会更快）

# when n = 10
require(rbenchmark)
benchmark(myFunction1(dt),
          yourFunction(dt), replications = 100,
          columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1  myFunction1(dt)          100    0.21    2.333
# 2 yourFunction(dt)          100    0.09    1.000

# when 1k / with 10 replications
n <-  1 * 1000
set.seed(231)
test <- sample(variants, size = n, replace = T)
dt <- data.table(X1 = test)
benchmark(myFunction1(dt),
          yourFunction(dt), replications = 10,
          columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1  myFunction1(dt)           10    0.03    1.000
# 2 yourFunction(dt)           10    0.52   17.333

# when 50k
n <-  50 * 1000
set.seed(231)
test <- sample(variants, size = n, replace = T)
dt <- data.table(X1 = test)
dt
benchmark(myFunction1(dt),
          yourFunction(dt), replications = 1,
          columns = c("test", "replications", "elapsed", "relative"))
# test replications elapsed relative
# 1  myFunction1(dt)            1    0.01        1
# 2 yourFunction(dt)            1    7.09      709


# time for 20 mil rows:
n <-  20e6
set.seed(231)
test <- sample(variants, size = n, replace = T)
dt <- data.table(X1 = test)
dt
system.time(myFunction1(dt))
# user  system elapsed 
# 6.23    0.78    7.04

【讨论】：

天哪！这出奇的快！在几秒钟内，而早些时候我的 R 会在运行 30 多分钟后不断中止......谢谢你，所以，非常感谢！！！