【发布时间】:2017-08-04 18:24:31
【问题描述】:
我对 R 比较陌生。
我有一个数据框 test,看起来像这样(纯文本只有 1 个变量 X1,但最多可以有 2000 万行):
DP - 2017 Jan 01
TI - Case Report of Severe Antithrombin Deficiency During Extracorporeal Membrane
Oxygenation and Therapeutic Plasma Exchange for Double Lung Transplantation.
PG - 11-13
LID - 10.1213/XAA.0000000000000412 [doi]
AB - Acquired antithrombin (AT) deficiency is not uncommon in cardiothoracic surgery
because of heparin exposure and dilutional or consumptive losses. We report a
case of acquired AT deficiency and resultant multiple deep vein thrombosis in a
patient with pulmonary fibrosis on veno-venous extracorporeal membrane
AD - From the Departments of *Anesthesiology and daggerCardiothoracic Surgery,
University of Maryland, Baltimore, Maryland.
JT - Saudi journal of kidney diseases and transplantation : an official publication of
the Saudi Center for Organ Transplantation, Saudi Arabia
JID - 9436968
我想使用前面的标签为没有的行(也就是开头有 3 个空格)重新创建“标签”。但是,我只需要为 TI 和 JT 重新创建标签,因为这些将是我最终需要提取的唯一行。
所以基本上,我生成的数据框应该如下所示:
DP - 2017 Jan 01
TI - Case Report of Severe Antithrombin Deficiency During Extracorporeal Membrane
TI - Oxygenation and Therapeutic Plasma Exchange for Double Lung Transplantation.
PG - 11-13
LID - 10.1213/XAA.0000000000000412 [doi]
AB - Acquired antithrombin (AT) deficiency is not uncommon in cardiothoracic surgery
because of heparin exposure and dilutional or consumptive losses. We report a
case of acquired AT deficiency and resultant multiple deep vein thrombosis in a
patient with pulmonary fibrosis on veno-venous extracorporeal membrane
AD - From the Departments of *Anesthesiology and daggerCardiothoracic Surgery,
University of Maryland, Baltimore, Maryland.
JT - Saudi journal of kidney diseases and transplantation : an official publication of
JT - the Saudi Center for Organ Transplantation, Saudi Arabia
JID - 9436968
在没有“标签”的行前面有 3 个空格,所以这是我当前的代码:
for (n in 1:nrow(test))
{
if (substr(test$X1[n], 1, 3) == " " && (substr(test$X1[n-1], 1, 2) == "TI" || substr(test$X1[n-1], 1, 2) == "JT"))
{
if (n > 1)
{
subs <- substr(test$X1[[n-1]], 1, 6)
}
subs <- substr(test$X1[[n-1]], 1, 6)
test$X1[n] <- sub(" ", subs, test$X1[n])
}
}
我当前的解决方案有效,但要在超过 2000 万行的文本上运行需要很长时间。请告知,因为我需要在多个大文件上运行此脚本。
谢谢。
【问题讨论】:
-
第一个问:
AB和AD从操作中排除了什么?第二问:需要操作后的数据顺序一致吗?dput(head(df,8))在这里会有所帮助 -
我最终将重塑数据,使
TI和JT成为变量名。我不需要AB和AD,所以没有必要对它们执行它。是的,顺序也是一样的。 -
所以我从文件中提取了随机行,所以它根本不匹配,但代码输出如下:
structure(list(X1 = c("STAT- MEDLINE", "IP - 23", "JT - The New England journal of medicine", "CIN - N Engl J Med. 2016 Dec 8;375(23 ):2286-2289. PMID: 27959676", "CIN - N Engl J Med. ;376(7):e11. PMID: 28207208", "CIN - N Engl J Med. ;376(7):e11. PMID: 28207209", "CIN - N Engl J Med. 2017 Feb 16;376(7):e11. PMID: 28199803", "DA - 20161213")), .Names = "X1", row.names = c(NA, -8L), class = c("tbl_df", "tbl", "data.frame")) -
令人惊讶的是,
for循环运行良好。我尝试了dplyr、map_df和lapply解决方案,但它们在microbenchmarking中的平均速度都较慢。接下来你应该考虑并行化...
标签: r performance for-loop recursion text