#R - 使用 R 将季度数据拆分为月度数据答案

【问题标题】：#R - Split Quarterly data into monthly data using R#R - 使用 R 将季度数据拆分为月度数据
【发布时间】：2018-11-27 09:31:53
【问题描述】：

请参阅下面的示例数据。

我想将季度销售数据（包括开始日期和结束日期）转换为月销售数据。

例如：

数据集 A-Row 1 将拆分为数据集 B-Row 1、2 和 3，分别用于 6 月、7 月和 8 月，销售将根据当月的天数按比例分配，所有其他列将一样；
数据集 A-Row 2 将拾取第 1 行（于 2017 年 5 月 9 日结束）中剩余的内容并形成完整的 9 月。

有没有有效的方法来执行这个，实际数据是一个 100K x 15 数据大小的 csv 文件，它将被拆分为大约 300K x 15 的新数据集以供每月分析。

样本问题数据的一些关键特征包括：

第一个季度销售数据的开始日期是客户加入的日期，因此可以是任何一天；
所有销售将按季度计算，但在 90 天、91 天或 92 天之间的不同天数，但由于客户在该季度离开，季度销售数据也可能不完整。

示例问题：

  Customer.ID Country       Type Sale Start..Date  End.Date Days
1           1      US Commercial   91   7/06/2017 5/09/2017   91
2           1      US Commerical   92   6/09/2017 6/12/2017   92
3           2      US     Casual   25  10/07/2017 3/08/2017   25
4           3      UK Commercial   64   7/06/2017 9/08/2017   64

示例答案：

   Customer.ID Country       Type Sale Start.Date   End.Date Days
1           1      US Commercial   24  7/06/2017 30/06/2017   24
2           1      US Commercial   31  1/07/2017 31/07/2017   31
3           1      US Commercial   31  1/08/2017 31/08/2017   31
4           1      US Commercial   30  1/09/2017 30/09/2017   30
5           1      US Commercial   31  1/10/2017 31/10/2017   31
6           1      US Commercial   30  1/11/2017 30/11/2017   30
7           1      US Commercial    6  1/12/2017  6/12/2017    6
8           2      US     Casual   22 10/07/2017 31/07/2017   22
9           2      US     Casual    3  1/08/2017  3/08/2017    3
10          3      UK Commercial   24  7/06/2017 30/06/2017   24
11          3      UK Commercial   31  1/07/2017 31/07/2017   31
12          3      UK Commercial    9  1/08/2017  9/08/2017    9

【问题讨论】：

欢迎来到 StackOverflow！请阅读有关how to ask a good question 的信息以及如何提供reproducible example。这将使其他人更容易帮助您。
看看这里：stackoverflow.com/questions/25062408/…
感谢 CIAndrews，但我认为这不是同一个问题。我确实在网络上进行了搜索，包括堆栈溢出，唯一接近的答案是由 excel vba 完成的，但考虑到实际大小，它总是最终被冻结。
在您的示例答案中，扩展是按国家/地区按类型完成的？客户 ID 应该是最大 3 是否正确？
您好，客户 ID 在所有国家和类型中都是唯一的，但每个客户 ID 可能是多行数据，因此是季度数据，并且客户 ID 不是最大为 3，实际上是 100s of 数千实际数据中的客户 ID 差异。 Country 和 Type 都可以作为扩展，它们在数据中显示的主要目的是为了分析阶段。

标签： r

【解决方案1】：

我刚刚运行了 CIAndrews 的代码。它似乎在大多数情况下都有效，但是在具有 10,000 行的数据集上运行时非常慢。等待几分钟后，我最终取消了执行。天数也有问题：例如，7 月有 31 天，但 days 变量只显示 30 天。 31-1 = 30 没错，不过第一天也要算。

下面的代码在我的 2015 MacBook Pro 上只需要大约 21 秒（不包括数据生成），并且还解决了其他问题。

library(tidyverse)
library(lubridate)


# generate data -------------------------------------------------------------

set.seed(666)

# assign variables
customer <- sample.int(n = 2000, size = 10000, replace = T)
country <- sample(c("US", "UK", "DE", "FR", "IS"), 10000, replace = T)
type <- sample(c("commercial", "casual", "other"), 10000, replace = T)
start <- sample(seq(dmy("7/06/2011"), today(), by = "day"), 10000, replace = T)
days <- sample(85:105, 10000, replace = T)
end <- start + days
sale <- sample(500:3000, 10000, replace = T)

# generate dataframe of artificial data
df_quarterly <- tibble(customer, country, type, sale, start, end, days)



# split quarters into months ----------------------------------------------

# initialize empty list with length == nrow(dataframe)
list_date_dfs <- vector(mode = "list", length = nrow(df_quarterly))

# for-loop generates new dates and adds as dataframe to list
for (i in 1:length(list_date_dfs)) {

    # transfer dataframe row to variable `row`
    row <- df_quarterly[i,]

    # correct end date so split successful when interval doesn't cover full month
    end_corr <- row$end + day(row$start) - day(row$end)

    # use lubridate to compute first and last days of relevant months
    m_start <- seq(row$start, end_corr, by = "month") %>% 
        floor_date(unit = "month")
    m_end <- m_start + days_in_month(m_start) - 1

    # replace first and last elements with original dates
    m_start[1] <- row$start
    m_end[length(m_end)] <- row$end

    # compute the number of days per month as well as sales per month
    # correct difference by adding 1
    m_days <- as.integer(m_end - m_start) + 1
    m_sale <- (row$sale / sum(m_days)) * m_days

    # add tibble to list
    list_date_dfs[[i]] <- tibble(customer = row$customer,
                                 country = row$country,
                                 type = row$type,
                                 sale = m_sale,
                                 start = m_start,
                                 end = m_end,
                                 days = m_days
    )
}

# bind dataframe list elements into single dataframe
df_monthly <- bind_rows(list_date_dfs)

【讨论】：

感谢@gersht，这非常干净和迅速。

【解决方案2】：

它使用多个函数和循环并不漂亮，因为它由多个操作组成：

# Creating the dataset
library(tidyr)
customer <- c(1,1,2,3)
country <- c("US","US","US","UK")
type <- c("Commercial","Commercial","Casual","Commercial")
sale <- c(91,92,25,64)
Start <- as.Date(c("7/06/2017","6/09/2017","10/07/2017","7/06/2017"),"%d/%m/%Y")
Finish <- as.Date(c("5/09/2017","6/12/2017","3/08/2017","9/08/2017"),"%d/%m/%Y")
days <- c(91,92,25,64)
df <- data.frame(customer,country, type,sale, Start,Finish,days)

# Function to split per month
library(zoo)
addrowFun <- function(y){
    temp <- do.call("rbind", by(y, 1:nrow(y), function(x) with(x, {
    eom <- as.Date(as.yearmon(Start), frac = 1)
    if (eom < Finish)
       data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
    else x
    })))
    return(temp)
 }
loop <- df
for(i in 1:10){ #not all months are split up at once
   loop <- addrowFun(loop)
}
# Calculating the days per month
loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))

# Creating the function to get the monthly sales pro rata
sumFun <- function(x){
   tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
   totalSale <- sum(tempSum$sale)
   totalDays <- sum(tempSum$days)
   return(x$days / totalDays * totalSale)
 }

for(i in 1:length(loop$customer)){
   loop$sale[i] <- sumFun(loop[i,])
}  

loop

【讨论】：

谢谢老兄，详细的回答，我会看一看，很快就会回复你。
您好，您的代码可以非常快速地复制示例答案，但是，当我尝试应用实际文件的提取时，它在第一个循环中返回错误，请您帮忙检查一下出了什么问题，我附上了代码。
嗨，如果有多个变量需要按比例计算，我应该在哪里修改原始代码以使其工作..例如而不是需要按比例进行的销售，如果有成本、收入、使用等都需要按比例处理。
我可以通过针对不同的变量添加更多的“sumFun”来解决这个问题，有没有办法简化这个过程？
您好，请您帮忙解释一下以下代码。#by(y, 1:nrow(y), function(x) with(x, { eom

【解决方案3】：

CiAndrews，

感谢您的帮助和耐心。我设法用小小的改变得到了答案。我已经用“plyr”包中的“rbind.fill”替换了“rbind”，之后一切运行顺利。

请看下面sample2.csv的头部

    customer   country    type     sale      Start      Finish     days
1 43108181108    US    Commercial  3330    17/11/2016  24/02/2017   99
2 43108181108    US    Commercial  2753    24/02/2017  23/05/2017   88
3 43108181108    US    Commercial  3043    13/02/2018  18/05/2018   94
4 43108181108    US    Commercial  4261    23/05/2017  18/08/2017   87
5 43103703637    UK    Casual      881     4/11/2016   15/02/2017   103
6 43103703637    UK    Casual      1172    26/07/2018  1/11/2018    98

请看下面的代码：

library(tidyr)

#read data and change the start and finish to data type

data <- read.csv("Sample2.csv")
data$Start <- as.Date(data$Start, "%d/%m/%Y")
data$Finish <- as.Date(data$Finish, "%d/%m/%Y")
customer <- data$customer
country <- data$country
days <- data$days
Finish <- data$Finish
Start <- data$Start
sale <- data$sale
type <- data$type
df <- data.frame(customer, country, type, sale, Start, Finish, days)

# Function to split per month

library(zoo)
library(plyr)
addrowFun <- function(y){
    temp <- do.call("rbind.fill", by(y, 1:nrow(y), function(x) with(x, {
        eom <- as.Date(as.yearmon(Start), frac = 1)
        if (eom < Finish)
            data.frame(customer, country, type, Start = c(Start, eom+1), Finish = c(eom, Finish))
        else x
    })))
    return(temp)
}
loop <- df
for(i in 1:10){ #not all months are split up at once
    loop <- addrowFun(loop)
}

# Calculating the days per month

loop$days <- as.numeric(difftime(loop$Finish,loop$Start, units="days"))

# Creating the function to get the monthly sales pro rata

sumFun <- function(x){
    tempSum <- df[x$Start >= df$Start & x$Finish <= df$Finish & df$customer == x$customer,]
    totalSale <- sum(tempSum$sale)
    totalDays <- sum(tempSum$days)
    return(x$days / totalDays * totalSale)
}

for(i in 1:length(loop$customer)){
    loop$sale[i] <- sumFun(loop[i,])
}  

loop

【讨论】：

在示例代码中有Start和Finish，而您的示例数据有End而不是Finish。当将所有Finish 更改为End 或反之亦然时，代码运行
嗨，我把代码中的所有End都改成了Finish，但还是报同样的错误，这个错误只出现在第一个循环中，可能是for(i in 1:10 ) 不适合实际样本。
您确定您的data 只有这些列还是更多？
嗨，共有 7 列：“customer”、“country”、“type”、“sale”、“start”、“finish”和“days”，但我从生成第一个索引列的样本 csv，样本大小为 300 行。
能否请您澄清一下 1:10 在第一个循环中的用途。