【问题标题】:Delete rows based on a match between two different columns in a data frame根据数据框中两个不同列之间的匹配删除行
【发布时间】:2020-12-02 12:00:00
【问题描述】:

我有一个数据框,其中包含来自多个渠道的每日渠道收入。数据框如下所示:

orders_dataframe:

    Order |Channel | Revenue |
    1     |TV      | 120     |
    2     |Email   | 30      |
    3     |Retail  | 300     |
    4     |Shop1   | 50      |
    5     |Shop2   | 90      |
    6     |Email   | 20      |
    7     |Retail  | 250     |

我想做的是根据预定义的比率(例如,60%/40% 的拆分)将来自零售的收入分成 Shop1 和 Shop2。例如,我希望所有收入来自“零售”的行都将 60% 归于 Shop1,将 40% 归于 Shop2。这可以通过用两个新行替换所有零售收入行来反映,如我想在下面得到的最终表中的 Order 3 和 Order 7 所示:

orders_dataframe:  

    Order |Channel | Revenue |
    1     |TV      | 120     |
    2     |Email   | 30      |
    3     |Shop1   | 180     |
    3     |Shop2   | 120     |
    4     |Shop1   | 50      |
    5     |Shop2   | 90      |
    6     |Email   | 20      |
    7     |Shop1   | 150     |
    7     |Shop2   | 100     |

理想情况下,由于我使用各种数据集执行此操作,我想从数据框 (split_dataframe) 中获取百分比,而不是手动分配数字 60% 和 40%。我想使用如下数据集中的数据:

split_dataframe:
    Channel |Percent  |
    Shop1   |60%      | 
    Shop2   |40%      | 

这是两个数据框的可重现示例:

orders_dataframe <- data.frame(Order = c(1,2,3,4,5,6,7),
                              Channel = c("TV", "Email", "Retail", "Shop1", "Shop2", "Email", "Retail"), 
                              Revenue = c(120,30,300,50,90,20,250))

split_dataframe <- data.frame(Channel = c("Shop1", "Shop2"),
                              Percent = c(0.6, 0.4))

非常感谢!

【问题讨论】:

    标签: r dataframe filter match aggregate


    【解决方案1】:

    您可以在基础 R 中执行此操作。

    orders_dataframe <- data.frame(Order = c(1,2,3,4,5,6,7),
                                   Channel = c("TV", "Email", "Retail", "Shop1", "Shop2", "Email", "Retail"), 
                                   Revenue = c(120,30,300,50,90,20,250))
    
    # Coerce the channel factor to a string.
    # Do you really want this as a factor?
    orders_dataframe$Channel <- as.character(orders_dataframe$Channel)
    
    # Create a vector of the replacement values.
    # The prob = c() argument lets you pick the
    # probabilities of each replacement.
    replacement <- sample(x = c("Store1","Store2"),
                          size = length(which(orders_dataframe$Channel == "Retail")),
                          replace = TRUE, prob = c(0.6, 0.4))
    
    # Replace the Channel columnn with the replacement vector.
    orders_dataframe$Channel[which(orders_dataframe$Channel == "Retail")] <- replacement
    

    【讨论】:

      【解决方案2】:

      dplyr

      split_dataframe  %>% 
      mutate(Index="Retail") %>%
      merge(.,orders_dataframe,by.x="Index",by.y="Channel") %>%
      mutate(Revenue=Revenue*Percent) %>%
      select(Order,Channel,Revenue) %>%
      bind_rows(orders_dataframe %>% filter(Channel !="Retail"),.)%>%
      arrange(.,Order)
      

      给予,

        Order Channel Revenue
      1     1      TV     120
      2     2   Email      30
      3     3   Shop1     180
      4     3   Shop2     120
      5     4   Shop1      50
      6     5   Shop2      90
      7     6   Email      20
      8     7   Shop1     150
      9     7   Shop2     100
      

      【讨论】:

      • 你好!非常感谢您的快速答复!在我的示例中,我只包含了 3 列。但是,我的真实数据集包含的不仅仅是这 3 列 - 最重要的是,其中一列是 order_date(处理订单的日期)。使用上述方法,不幸的是,该列中的信息丢失了。有没有办法在新拆分中保留 order_date?
      • 你可以像select(order_date,Order,Channel,Revenue)一样在选择步骤添加它
      • 再次感谢您的快速回复!这很有帮助!我将您的回复标记为答案,因为我真的很喜欢短代码!不确定我是否可以在这里问这个问题,或者我是否应该编辑我的问题并将其添加为后续问题,但还有一件事:如果我每天有不同的频道拆分,调整代码是否容易?所以“order_dataframe”有一个order_date列,“split_dataframe”有一个“date”列,我想乘以与相应日期匹配的比率?
      • 你可以通过应用group_by、merge、join等来做到这一点。我认为你应该把你的精力放在这上面。上面的答案让您对此有所了解。我的建议,尝试逐行运行上面的代码(我的意思是%&gt;% by %&gt;% 以查看每个步骤)并应用您的适应。
      【解决方案3】:

      这是一个data.table 方法...请参阅代码中的 cmets 进行解释

      library( data.table )
      #make them data.tables
      setDT( orders_dataframe ); setDT( split_dataframe )
      #split to retail en non-retail orders
      orders_retail    <- orders_dataframe[ Channel == "Retail", ]
      orders_no_retail <- orders_dataframe[ !Channel == "Retail", ]
      #divide the retail orders over the two shops (multiple steps)
      #create a new colum by shop
      shop_cols <- split_dataframe$Channel
      orders_retail[, (shop_cols) := Revenue ]
      #melt to long format
      orders_retail.melt <- melt( orders_retail, 
                                  id.vars = "Order", 
                                  measure.vars = (shop_cols),
                                  variable.name = "Channel",
                                  value.name = "Revenue")
      #and update the molten data with the percentages in the split_dataframe
      orders_retail.melt[ split_dataframe, 
                          Revenue := Revenue * i.Percent,
                          on = .( Channel )]
      #merge everything back together and order on Order id
      ans <- rbind( orders_no_retail, orders_retail.melt )
      setorder( ans, Order )
      #    Order Channel Revenue
      # 1:     1      TV     120
      # 2:     2   Email      30
      # 3:     3   Shop1     180
      # 4:     3   Shop2     120
      # 5:     4   Shop1      50
      # 6:     5   Shop2      90
      # 7:     6   Email      20
      # 8:     7   Shop1     150
      # 9:     7   Shop2     100
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2022-01-13
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2023-02-22
        • 2022-08-03
        • 1970-01-01
        相关资源
        最近更新 更多