【问题标题】:Transpose dplyr::tbl object转置 dplyr::tbl 对象
【发布时间】:2018-04-26 18:31:57
【问题描述】:

我正在使用 src_postgres 进行连接,并使用 dplyr::tbl 函数从 redshift 数据库中获取数据。我使用 dplyr 本身对其应用了一些过滤器和顶级功能。现在我的数据如下所示:

   riid   day         hour 
   <dbl> <chr>       <chr>
 1 5542. "THURSDAY " 12   
 2 5862. "FRIDAY   " 15   
 3 5982. "TUESDAY  " 15   
 4 6022. WEDNESDAY   16 

My final output should be as below:
riid    MON   TUES  WED   THUR   FRI   SAT  SUN
5542                       12
5862                             15
5988           15
6022                 16

我尝试过传播。由于类类型,它会引发以下错误:

UseMethod("spread_") 中的错误:没有适用于“spread_”的方法 应用于类“c('tbl_dbi', 'tbl_sql', 'tbl_lazy', 'tbl')"

由于这是一个非常大的表,我不想使用数据框,因为它需要更长的时间。 我可以使用如下:

df_mon <- df2 %>% filter(day == 'MONDAY') %>% mutate(MONDAY = hour) %>% select(riid,MONDAY)
df_tue <- df2 %>% filter(day == 'TUESDAY') %>% mutate(TUESDAY = hour) %>% select(riid,TUESDAY)
df_wed <- df2 %>% filter(day == 'WEDNESDAY') %>% mutate(WEDNESDAY = hour) %>% select(riid,WEDNESDAY)
df_thu <- df2 %>% filter(day == 'THURSDAY') %>% mutate(THURSDAY = hour) %>% select(riid,THURSDAY)
df_fri <- df2 %>% filter(day == 'FRIDAY') %>% mutate(FRIDAY = hour) %>% select(riid,FRIDAY)

是否可以在一个语句中写出以上所有内容?

非常感谢任何帮助以更快的方式进行转换的帮助。

编辑 添加tbl对象的dput:

structure(list(src = structure(list(con = <S4 object of class structure("PostgreSQLConnection", package = "RPostgreSQL")>, 
    disco = <environment>), .Names = c("con", "disco"), class = c("src_dbi", 
"src_sql", "src")), ops = structure(list(name = "select", x = structure(list(
    name = "filter", x = structure(list(name = "filter", x = structure(list(
        name = "group_by", x = structure(list(x = structure("SELECT riid,day,hour,sum(weightage) AS score FROM\n  (SELECT riid,day,hour,\n  POWER(2,(cast(datediff (seconds,convert_timezone('UTC','PKT',SYSDATE),TO_DATE(TO_CHAR(event_captured_dt,'mm/dd/yyyy hh24:mi:ss'),'mm/dd/yyyy hh24:mi:ss')) as decimal) / cast(7862400 as decimal))) AS weightage\n  FROM (\n  SELECT riid,convert_timezone('GMT','PKT',event_captured_dt) AS EVENT_CAPTURED_DT,\n  TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'DAY') AS day,\n  TO_CHAR(convert_timezone('GMT','PKT',event_captured_dt),'HH24') AS hour\n  FROM Zameen_STO_DATA WHERE EVENT_CAPTURED_DT >= TO_DATE((sysdate -30),'yyyy-mm-dd') and LIST_ID = 4282\n  )) group by riid,day,hour", class = c("sql", 
        "character")), vars = c("riid", "day", "hour", "score"
        )), .Names = c("x", "vars"), class = c("op_base_remote", 
        "op_base", "op")), dots = structure(list(riid = riid, 
            day = day), .Names = c("riid", "day")), args = structure(list(
            add = FALSE), .Names = "add")), .Names = c("name", 
    "x", "dots", "args"), class = c("op_group_by", "op_single", 
    "op")), dots = structure(list(~min_rank(desc(~score)) <= 
        1), .Names = ""), args = list()), .Names = c("name", 
    "x", "dots", "args"), class = c("op_filter", "op_single", 
    "op")), dots = structure(list(~row_number() == 1), .Names = ""), 
    args = list()), .Names = c("name", "x", "dots", "args"), class = c("op_filter", 
"op_single", "op")), dots = structure(list(~riid, ~day, ~hour), class = "quosures", .Names = c("", 
"", "")), args = list()), .Names = c("name", "x", "dots", "args"
), class = c("op_select", "op_single", "op"))), .Names = c("src", 
"ops"), class = c("tbl_dbi", "tbl_sql", "tbl_lazy", "tbl"))

【问题讨论】:

    标签: r tidyr dplyr spread


    【解决方案1】:

    我认为您正在寻找的是针对远程源或数据库运行tidyr::spread() 函数的能力。我有一个dbplyr 的 PR,它试图在这里实现它:https://github.com/tidyverse/dbplyr/pull/72,您可以使用:devtools::install_github("tidyverse/dbplyr", ref = devtools::github_pull(72)) 进行尝试。

    【讨论】:

    • 谢谢你,我会尽力让你知道的。
    • df2 %>% group_by(name_id, month, year) %>% tidyr::spread(key = interaction_type,value = 1,fill=0) 这就像一个魅力。再次感谢!!
    • 一个简单的问题,是否可以合并多个列然后对其应用展开?
    【解决方案2】:

    使用reshape2包中的dcast

    > data
    # A tibble: 4 x 3
       riid day    hour
      <dbl> <chr> <dbl>
    1  1.00 TH     12.0
    2  2.00 FR     15.0
    3  3.00 TU     15.0
    4  4.00 WE     16.0
    
    > dcast(data, riid~day, value.var = "hour")
    
      riid FR TH TU WE
    1    1 NA 12 NA NA
    2    2 15 NA NA NA
    3    3 NA NA 15 NA
    4    4 NA NA NA 16
    

    进一步如果你想删除NA,那么

    > z <- dcast(data, riid~day, value.var = "hour")
    > z[is.na(z)] <- ""
    > z
      riid FR TH TU WE
    1    1    12      
    2    2 15         
    3    3       15   
    4    4          16
    

    【讨论】:

    • 这会引发错误:错误:value.var (hour) not found in input
    • 您是否包括reshape2library("reshape2")
    • 是的,它已经是我的库包的一部分。我只是想知道使用 tibble 是否会有所作为。目前这是我在打印时看到的:*********************`df2 # Source:lazy query [?? x 3] # 数据库:postgres 8.0.2 # 组:riid, day`
    • 我正在使用 tibble,我发布的是我的 rstudio 控制台的直接输出。你能试试这个dcast(data, data$riid~data$day, value.var = "hour")
    • 您是否将小时用引号括起来,value.var = "hour"
    【解决方案3】:

    我尝试将您的多行尝试合二为一。你可以试试这个并告诉我们结果吗?

    library(dplyr)
    
    df %>%
      rowwise() %>%
      mutate(Mon = ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
             Tue = ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
             Wed = ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
             Thu = ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
             Fri = ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
             Sat = ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
             Sun = ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA)) %>%
      select(-day, -hour)
    

    输出为:

       riid Mon     Tue   Wed   Thu   Fri Sat   Sun  
    1  5542 NA       NA    NA    12    NA NA    NA   
    2  5862 NA       NA    NA    NA    15 NA    NA   
    3  5982 NA       15    NA    NA    NA NA    NA   
    4  6022 NA       NA    16    NA    NA NA    NA 
    

    样本数据:

    # A tibble: 4 x 3
       riid day        hour
    * <dbl> <chr>     <int>
    1  5542 THURSDAY     12
    2  5862 FRIDAY       15
    3  5982 TUESDAY      15
    4  6022 WEDNESDAY    16
    


    更新: 您可以使用data.table 尝试以下方法吗?

    library(data.table)
    
    dt <- setDT(df)[, c("Mon","Tue","Wed","Thu","Fri","Sat","Sun") := 
                      list(ifelse(day=='MONDAY', hour[day=='MONDAY'], NA),
                           ifelse(day=='TUESDAY', hour[day=='TUESDAY'], NA),
                           ifelse(day=='WEDNESDAY', hour[day=='WEDNESDAY'], NA),
                           ifelse(day=='THURSDAY', hour[day=='THURSDAY'], NA),
                           ifelse(day=='FRIDAY', hour[day=='FRIDAY'], NA),
                           ifelse(day=='SATURDAY', hour[day=='SATURDAY'], NA),
                           ifelse(day=='SUNDAY', hour[day=='SUNDAY'], NA))][, !c("day","hour"), with=F]
    

    【讨论】:

    • 当我按照您的要求执行时,出现错误:Error: is.data.frame(data) is not TRUE。我认为您提供的修复程序非常适合 tibble。我的是来自 src_postgres 连接的 tbl 对象。
    • 你能分享dput(your_tbl_object)的输出吗?
    • 我无法使用您的dput o/p,因此我添加了另一种使用data.table 的方法。我认为它的处理时间应该不会更长。
    • 我已经尝试过了。但它没有工作:(以下是我到目前为止所做的:df &lt;- tbl(conn,sql( "SELECT ... )"))df2 &lt;- df %&gt;% group_by(riid,day) %&gt;% top_n(n=1,wt=score) %&gt;% filter(row_number()==1) %&gt;% select(riid,day,hour) 我正在尝试根据日列转置上述语句的输出。
    猜你喜欢
    • 1970-01-01
    • 2014-07-03
    • 2014-03-04
    • 2018-12-15
    • 1970-01-01
    • 2019-08-16
    相关资源
    最近更新 更多