两个 data.table 数据集之间的快速映射答案

【问题标题】：Fast mapping between two data.table datasets两个 data.table 数据集之间的快速映射
【发布时间】：2017-01-23 21:00:06
【问题描述】：

我想获取与数据集中所有邮政编码相关联的县名。我使用数据框获得了相对较快的结果（尽管我觉得它可以做得更快），但使用 data.table 却不是这样，即使进行了一些优化。有没有办法使用数据帧或 data.tables 进一步加快速度？

这是我的初始化（基于this answer）：

library(noncensus)
data(zip_codes)
data(counties)
counties$fips <- as.numeric(paste0(counties$state_fips, counties$county_fips))

使用数据帧计算（秒稍快，正如预期的那样） - 20、16 秒：

system.time(sapply(zip_codes$fips, function(x) subset(counties, fips == x)$county_name))
system.time(sapply(zip_codes$fips, function(x) counties[counties$fips==x,]$county_name))

使用数据表计算 - 60、43 秒：

zip_codes.dt <- data.table(zip_codes)
counties.dt <- data.table(zip_codes)
system.time(sapply(zip_codes.dt$fips, function(x) subset(counties.dt, fips == x)$county_name))
setkey(counties.dt, fips)  # optimizing
system.time(sapply(zip_codes.dt$fips, function(x) counties.dt[.(x)]$county_name))

【问题讨论】：

你的前三行，counties$fips的构造，可以用counties$fips <- interaction(counties$state_fips, counties$county_fips)在一行中获得。
@lmo 这在因子变量的值中引入了.，这与zip_codes 中fips 的命名不匹配。
interaction 的 sep 参数设置为“.”默认。使用 sep="" 摆脱它。
我们可以只使用paste，即paste0(counties$state_fips, counties$county_fips)，它将是一个character向量
您在这里所做的大部分操作都与标准 data.table 语法非常不一致。也许先浏览一下小插曲。对于初学者：counties.dt[.(x)]$county_name 应该使用, county_name 而不是$；而且我怀疑有任何理由使用sapply 而不是在那里进行单次连接...

标签： r data.table

【解决方案1】：

根据 @Frank 的建议，阅读小插图 here 和包文档帮助我使用 data.table 找到了答案。

这里是：

zip_codes.dt[counties.dt, on="fips", county_name := county_name]

【讨论】：