【问题标题】:SQL querying dataframes inside listSQL查询列表内的数据框
【发布时间】:2012-09-02 23:39:06
【问题描述】:

给定数据框

df1 <- data.frame(CustomerId=c(1:6),Product=c(rep("Toaster",3),rep("Radio",3)))
df2 <- data.frame(CustomerId=c(2,4,6),State=c(rep("Alabama",2),rep("Ohio",1)))

存储在列表中

dflist <- c(df1,df2)

如何在这些数据帧上运行 sqldf 查询(连接)?

尝试失败:

test <- sqldf("select a.CustomerId, a.Product, b.State from dflist[1] a
          inner join dflist[2] b on b.id = a.id")

test <- sqldf("select a.CustomerId, a.Product, b.State from dflist$df1 a
          inner join dflist$df2 b on b.CustomerId = a.CustomerId")

【问题讨论】:

    标签: r list dataframe data.table sqldf


    【解决方案1】:

    如果您将 data.frames 从列表复制到新环境,则可以使用 envir 参数到 sqldf 或通过命名列表元素并使用 with

    注意几点:

    • 我使用 list 而不是 c 创建 dflist

    注意区别

    str(c(df1,df2))
    ##List of 4
    ## $ CustomerId: int [1:6] 1 2 3 4 5 6
    ## $ Product   : Factor w/ 2 levels "Radio","Toaster": 2 2 2 1 1 1
    ## $ CustomerId: num [1:3] 2 4 6
    ## $ State     : Factor w/ 2 levels "Alabama","Ohio": 1 1 2
    
    str(list(df1,df2))
    ##List of 2
    ## $ :'data.frame': 6 obs. of  2 variables:
    ##  ..$ CustomerId: int [1:6] 1 2 3 4 5 6
    ##  ..$ Product   : Factor w/ 2 levels "Radio","Toaster": 2 2 2 1 1 1
    ## $ :'data.frame': 3 obs. of  2 variables:
    ##  ..$ CustomerId: num [1:3] 2 4 6
    ##  ..$ State     : Factor w/ 2 levels "Alabama","Ohio": 1 1 2
    
    • 我已调整 sql 查询以反映 data.frames 中的名称(根据您的第二种方法)

    命名数据

    dflist <- list(df1,df2)
    names(dflist) <- c('df1','df2')
    

    创造一个新的工作环境

    # create a new environment
    
    e <- new.env()
    # assign the elements of dflist to this new environment
    for(.x in names(dflist)){
      assign(value = dflist[[.x]], x=.x, envir = e)
    }
    
    # this could also be done using mapply / lapply
    # eg
    # invisible(mapply(assign, value = dflist, x = names(dflist), MoreArgs =list(envir = e)))
    # run the sql query
    sqldf("select a.CustomerId, a.Product, b.State from df1 a
              inner join df2 b on b.CustomerId = a.CustomerId", envir = e)
    
    ##  CustomerId Product   State
    ## 1          2 Toaster Alabama
    ## 2          4   Radio Alabama
    ## 3          6   Radio    Ohio
    

    使用with 的更简单方法

    您可以简单地使用 with 进行本地评估(重要的是 dflist 在这里是一个命名列表)

    # this is far simpler!!
    with(dflist,sqldf("select a.CustomerId, a.Product, b.State from df1 a
               inner join df2 b on b.CustomerId = a.CustomerId"))
    

    另一个使用proto的简单方法

    • 感谢@G.Grothendieck(请参阅 cmets

    这使用了proto 包,它加载了sqldf

    dflist <- list(a = df1, b = df2)
    sqldf( "select a.CustomerId, a.Product, b.State from df1 a 
             inner join df2 b on b.CustomerId = a.CustomerId", 
             envir = as.proto(dflist))
    

    使用 data.table

    或者您可以使用data.table,它提供sql-like 方法(请参阅FAQ 2.16

    library(data.table)
    dflist <- list(data.table(df1),data.table(df2))
    names(dflist) <- c('df1','df2')
    invisible(lapply(dflist, setkeyv, 'CustomerId'))
    with(dflist, df1[df2])
    ##    CustomerId Product   State
    ## 1:          2 Toaster Alabama
    ## 2:          4   Radio Alabama
    ## 3:          6   Radio    Ohio
    

    【讨论】:

    • 我已经展示了如何使用 assignmapply 而不是 for loop
    • 这里是 envir=e 使用 proto 方法的变体(sqldf 已经引入,因此您不必单独加载它):dflist &lt;- list(a = df1, b = df2); sqldf( "select a.CustomerId, a.Product, b.State from df1 a inner join df2 b on b.CustomerId = a.CustomerId", envir = as.proto(dflist)) 或者我们可以直接创建 proto 对象像这样使用 sqldf envir= 参数:envir = proto(a = df1, b = df2)
    猜你喜欢
    • 2013-08-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-12-18
    • 2021-02-19
    • 2017-05-17
    • 1970-01-01
    • 2019-08-23
    相关资源
    最近更新 更多