【发布时间】:2016-03-15 01:18:08
【问题描述】:
我有a long file,我使用readLines/strsplit 将其读入列表:
> head(edges.split)
[[1]]
[1] "1" "1263895" "4415645" "1798592" "576013" "1315720" "1179526"
[8] "4257735" "4368477" "4045891" "336813" "4257736" "1179526" "3494186"
[15] "4257735" "4257735"
[[2]]
[1] "2" "4831424" "2070750" "3" "798464" "1208032" "351213"
[8] "2816552" "1484206" "4493159" "5" "1" "4" "4493043"
[15] "3126743" "1207504" "1499874" "214487" "173486" "1484207"
[[3]]
[1] "3" "2" "4" "3648046" "1872711" "1275714" "702512"
[8] "1275655" "1667650" "1484207"
[[4]]
[1] "4" "4463893" "3618982" "3624614" "3299496" "4348657" "4104419"
[8] "3070955" "2707725" "5" "4463739" "4158900" "1135360" "653364"
[15] "806185" "2465873" "3299496" "3060623" "1965801" "1005013" "3070955"
[22] "3103098" "4283482" "1951317" "1487656" "4632995" "4402849" "2707725"
[29] "1564441" "576420" "1972753" "1740415" "3070390" "2391329" "3827055"
[36] "996590" "4267592" "3787645" "1857269" "4348657" "3491190" "3787645"
[43] "3149658" "3159019" "3787645" "1135358" "2183685" "2303714" "3159019"
[50] "2465873" "4276571" "4446386" "2854060" "3299496" "1740415" "4402849"
[57] "4632995" "3494237" "2050300" "1135358" "3787645"
[[5]]
[1] "5" "336813" "4" "3159019" "2303714" "1740415" "4"
[8] "305277" "2707725" "2303714" "1740415" "3494237" "1135358" "4"
[[6]]
[1] "6" "499620" "3622792" "1315540" "576013" "1798592" "3965874"
[8] "752451" "1017219" "1762253" "3693356" "348788" "4038359" "336813"
[15] "3449680" "4717601" "3545052" "4494041" "748702" "1093005" "3143747"
[22] "1648572" "1093005" "1648572" "3143747"
现在我想将其转换为 3 列 data.frame/data.table:
edges.df <- do.call(rbind,lapply(edges.split,function (l)
if (length(l) <= 1) NULL
else {
tab <- table(tail(l,-1))
data.table(src=as.integer(l[1]),
dst=as.integer(names(tab)),
weight=as.numeric(tab))
}))
str(edges.df)
str(edges.df) # 156716688x2
Classes ?data.table? and 'data.frame': 116330611 obs. of 3 variables:
$ src : int 1 1 1 1 1 1 1 1 1 1 ...
$ dst : int 1179526 1263895 1315720 1798592 336813 3494186 4045891 4257735 4257736 4368477 ...
$ weight: num 2 1 1 1 1 1 1 3 1 1 ...
这需要 5.5 小时 并消耗 20GB RAM(data.frame 版本一直运行 - 15 小时并且还在计数)。
更简单的矩阵版本
edges.df <- do.call(rbind,lapply(edges.split,function (l)
cbind(as.integer(l[1]),as.integer(tail(l,-1)))))
在 10 分钟内完成,生成 156716688x2 矩阵。
table 调用造成巨大的时差吗?
我怎样才能加快速度?
【问题讨论】:
-
为什么要投反对票?这个问题有什么问题?
-
我很好奇您的实际数据与您的方法相比的总体时间安排。你介意分享吗?
-
您的版本需要 3.5 分钟;将多个链接折叠成权重需要另外 4 分钟。谢谢!
标签: r performance data.table