【发布时间】:2020-09-27 18:21:16
【问题描述】:
我有一个包含 5m 观察的数据框,其简化版本如下所示:
df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=c("A","B","C"), seller =c("B","A","D"),amount=c(1,4,2))
上面的例子是这样写的:在日期 2020-05-05,代理 A 从代理 B 购买了 1 个金额,依此类推。
在数据集中,有超过 800 个不同日期的大约 500k 唯一买家和卖家。
对于每个日期,我想创建一个 nxn 矩阵,该矩阵代表正在交易的代理的每日库存变化。这个每日计算的矩阵应该存储在一个列表中。所以对于上面的例子,结果是:
╔══════════════╗
║ A B C D ║
╠══════════════╣
║ A 0 -3 0 0 ║
║ B +3 0 0 0 ║
║ C 0 0 0 2 ║
║ D 0 0 -2 0 ║
╚══════════════╝
代理 A 首先从代理 B 购买 1 个单位,然后卖回 4 个,因此拥有 -3。
我的代码如下所示:
library("tidyverse")
df <- data.frame(date=as.Date(c("2020-05-05","2020-05-05","2020-05-05")), buyer=as.character(c("A","B","C")), seller =as.character(c("B","A","D")),amount=c(1,4,2))
daily_matrices <- list() #create empty list to store matrices
dates <- unique(as.Date(df$date))
for (i in 1: length(dates)) { # loop over every date
loop_date <- dates[i]
daily_subset <- df %>% filter(date==loop_date) #filter data for each date
daily_subset_long <- daily_subset %>%
gather(key="var", value="agent",buyer,seller)
daily_agents <- distinct(daily_subset_long, agent) # find unique agents
daily_pairs<-combn(daily_agents$agent,2) # find each possible pair
ndim <- dim(daily_agents)[1]
daily_matrices[[i]] <- matrix(data=0,nrow=ndim, ncol=ndim) #span matrix
colnames(daily_matrices[[i]])<-daily_agents$agent #name columns with agents
rownames(daily_matrices[[i]])<-daily_agents$agent #name rows with agents
for (j in 1: dim(daily_pairs)[2]) { # for each possible pair call below function
trading_partner(daily_pairs[1,j],daily_pairs[2,j])
}
print(i) # just to track progress
}
trading_partner <-function(x,y) {
agent_daily_subset <- daily_subset %>% filter(buyer== x & seller== y | buyer== y & seller== x) # filter trades for each pair
agent_daily_subset_long <- agent_daily_subset %>%
gather(key="var", value="agent",buyer,seller)
agent_daily_subset_long <- agent_daily_subset_long %>% group_by(agent) %>%
mutate(delta_inventory = case_when(var =="buyer" ~ amount,
var =="seller" ~ -amount)) # calculates change in inventory for each trade
subgroup_inventory <- agent_daily_subset_long %>% group_by(agent) %>% summarise(inventory = sum(delta_inventory)) # summarisses change in inventory for each of the two agents in a pair
if (dim(subgroup_inventory)[1] >0) { #if there has been a trade between the pair paste the inventory change in the list of matrices and find the correct row and column by the name of the agents
daily_matrices[[i]][as.character(subgroup_inventory[1,1]),as.character(subgroup_inventory[2,1])] <<- as.double(subgroup_inventory[1,2])
daily_matrices[[i]][as.character(subgroup_inventory[2,1]),as.character(subgroup_inventory[1,1])] <<- as.double(subgroup_inventory[2,2])
}
}
这按预期工作,但我遇到了问题,因为原始数据集中每天大约有 1000 个不同的代理,因此我创建了巨大的矩阵。
我知道在 R 中使用循环一开始并不是首选方式,但无法提出另一种解决方案。对于每个每日矩阵,上述代码大约需要 30 分钟。如果需要 800 天,则需要 2 周以上才能顺利完成。
是否有可能以最佳实践方式加快速度?
【问题讨论】: