从多个变量的复杂计算中计算新变量的更好方法，一些 NA答案

【问题标题】：Better way to calculate new variable from complex calculation of multiple variables, some NAs从多个变量的复杂计算中计算新变量的更好方法，一些 NA
【发布时间】：2016-02-16 19:05:49
【问题描述】：

我试图找到一个干净、高效的方法来创建一个对 5 个现有变量进行复杂计算的新变量。我的问题是，一个变量是一个因素，而其他 4 个包含 NA。

我有一个包含多组变量的数据集，结构如下：

expenditure_period - 1 = 每天，2 = 每周，3 = 的因子每月，4 = 每年
expenditure1 - 整数，每天花费的金额
expenditure2 - 整数，每周花费的金额
expenditure3 - 整数，每月花费的金额
expenditure4 - 整数，一年中花费的金额

对于每一行/观察，4 个整数字段中只有一个具有数值，具体取决于支出周期的值，其余为 NA。

例如：

   expenditure_period  expenditure1  expenditure2  expenditure3  expenditure4
1             monthly            NA            NA             5            NA
2              weekly            NA             5            NA            NA
3             monthly            NA            NA             2            NA
4             monthly            NA            NA             5            NA
5             monthly            NA            NA            58            NA

我想创建一个包含标准每月支出的新变量。因此，如果支出周期为每日，则支出 1*30。如果是每周，则支出 2 * 4。如果是每月，则支出 3 * 1。如果每年，则支出 4 / 12。

我能想到的最佳解决方案是以下混乱：

data$expenditure_factor[data$expenditure_period=="daily"] <- 30
data$expenditure_factor[data$expenditure_period=="weekly"] <- 4
data$expenditure_factor[data$expenditure_period=="monthly"] <- 1
data$expenditure_factor[data$expenditure_period=="yearly"] <- 1/12
data$expenditure_month <- apply(data[,c("expenditure1", "expenditure2",
 "expenditure3", "expenditure4", "expenditure_factor")], 1, 
function(x) { sum(x[1:4], na.rm=TRUE) * x[5]} )

我尝试使用 + 运算符将支出 1、2、3、4 相加，但是由于将 1 个数字添加到 3 个 NA，这导致所有 NA。我尝试使用带有 rm.na 的 sum 函数创建一个临时变量，但这导致每一行的总和相同。我尝试使用 dplyr 包中的 mutate ，但没有效果。

有没有更简单、更优雅的方法来做到这一点？我必须对大约 12 种不同的支出类别进行相同的处理。如果以前有人问过这个问题，我很抱歉，我找不到类似的线程。如果已经有请指导我。

我在 Windows 7 上使用 RStudio 和 R 3.2.3。

【问题讨论】：

如果您的示例易于重现并且您也显示了您想要/预期的结果，那就更好了。以下是一些指导：stackoverflow.com/a/28481250/1191259
将apply 语句与switch 一起使用

标签： r dataframe dplyr

【解决方案1】：

“干净、高效”是一种观点，但如果您有一段时间没有查看代码，以下内容将很容易维护和理解。它将数据保存在单独的表中，一次只做一件事，并且可以在步骤之间进行检查。

# conversion table to replace bulk of mess with slightly better mess of code that is easy to inspect
expenditure_factor <- data.frame(expenditure_period = c('daily','weekly','monthly','yearly'),
                                 pfactor = c(30,4,1,1/12),
                                 stringsAsFactors = F)

# sum total expenditure (expenditurex) and remove extra columns
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)
data$expenditure1 <- data$expenditure2 <- data$expenditure3 <- data$expenditure4 <- NULL

# add factor from conversion table
data <- merge(data,expenditure_factor,by = 'expenditure_period',all.x = T)

# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$pfactor

或者这可以被塞进一个衬里。

假设expendance_period是一个字符变量：

data$expenditure_period <- as.character(data$expenditure_period)

然后：

# sum total expenditure
data$sumexpenditure <- apply(data[ ,2:5],1,sum,na.rm = T)

# use an index
data$expenditure_factor <- c(30,4,1,1/12)[match(data$expenditure_period,c('daily','weekly','monthly','yearly'))]

# calculate final answer
data$expenditure_month <- data$sumexpenditure * data$expenditure_factor

【讨论】：

两者我都喜欢，但我所追求的是与上一个示例类似的东西。谢谢！重复几次似乎不那么繁琐且更具可读性。我喜欢第一个有一个因素的参考表，但由于无论如何我必须根据每个不同支出期间的价值将它添加到基础上，它并没有真正使代码本身更短。

【解决方案2】：

好的，这可能是一种有点不正统的方法，但是如果您重命名列以使其包含乘数，重新调整数据并提取乘数以用于计算新变量，该怎么办：

library(dplyr)
library(tidyr)

# New cols
data<-rename(data, expenditure.30 = expenditure1, 
            expenditure.4 = expenditure2,
            expenditure.1 = expenditure3,
            `expenditure.1/2` = expenditure4)

# Reshape and calculate new col
data %>% gather(exp_new,exp_val,expenditure.30:`expenditure.1/2`) %>% 
        mutate(mont_exp = exp_val * as.numeric(sub('.*\\.', '', exp_new))) %>%
        na.omit()
#   expenditure_period       exp_new exp_val mont_exp
#7              weekly expenditure.4       5       20
#11            monthly expenditure.1       5        5
#13            monthly expenditure.1       2        2
#14            monthly expenditure.1       5        5
#15            monthly expenditure.1      58       58

【讨论】：

非正统但有趣！我喜欢它利用 dplyr 和 tidyr。非常感谢您的帮助。