【发布时间】:2021-02-21 17:33:33
【问题描述】:
我在 R 中有一个具有以下结构的数据框:
ID Date
ID-1 2020-02-10 13:12:04
ID-2 2020-02-12 15:02:24
ID-3 2020-02-14 12:25:32
我正在使用以下查询从 MySQL 中获取数据,这是我遇到问题的地方,因为如果 ID(即 ~90K)我有大量数据。当我传递 500-1000 ID 时,它工作正常,但传递 90K Id 时会引发错误。
Data_frame<-paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")
查询以以下方式返回输出,我需要rbind 和DF 使用ID。
查询输出
ID Name output
ID-1 Name1 23
ID-1 Name2 20
ID-2 Name1 40
ID-2 Name2 97
ID-3 Name1 34
ID-3 Name2 53
所需输出:
ID Date Name1 Name2
ID-1 2020-02-10 13:12:04 23 20
ID-2 2020-02-12 15:02:24 40 97
ID-3 2020-02-14 12:25:32 34 53
我已经尝试了下面提到的代码:
createIDBatchVector <- function(x, batchSize){
paste0(
"'"
, sapply(
split(x, ceiling(seq_along(x) / batchSize))
, paste
, collapse = "','"
)
, "'"
)
}
# second helper function
createQueries <- function(IDbatches){
paste0("
SELECT c.ID, e.name,d.output
FROM Table1 c
left outer join Table2 d ON d.ID=c.ID
LEFT outer JOIN Table1 e ON e.ID_2=d.ID_2
WHERE e.name in ('Name1','Name2')
AND c.ID IN (", paste(shQuote(DF$ID, type = "sh"),collapse = ', '), ")
;")}
# ------------------------------------------------------------------
# and now the actual script
# first we create a vector that contains one batch per element
IDbatches <- createIDBatchVector(DF$ID, 2)
# It looks like this:
# [1] "'ID-1','ID-2'" "'ID-3','ID-4'" "'ID-5'"
# now we create a vector of SQL-queries out of that
# queries <- createQueries(IDbatches)
df_final <- data.frame() # initialize a dataframe
conn <- database # open a connection
for (query in queries){ # iterate over the queries
df_final <- rbind(df_final, dbGetQuery(conn,query))}
【问题讨论】:
标签: r dataframe dplyr tidyverse