【发布时间】:2019-01-10 17:58:01
【问题描述】:
我将数据导入 R 的平台不支持指定数据类型,因此我的所有列都是 character。我有一个 Excel 文件,它指定哪些列是 factor,包括相关的 labels 和 levels。现在,我正在尝试编写一个函数来动态更改我的 data.frame 各列的数据类型
感谢对这个问题 (dplyr - mutate: use dynamic variable names) 的出色回答,我设法编写了以下函数,其中我将列名动态设置为 mutate 函数。
readFactorData <- function(filepath) {
t <- read.xlsx(filepath)
sapply(nrow(t), function(i) {
colname <- as.character(t[i, "Item"])
factorLevels <- t[i, 3:ncol(t)][which(!is.na(t[i, 3:ncol(t)]))]
totalLevels <- length(factorLevels)
listOfLabels <- as.character(unlist(factorLevels))
mutate(d, !!colname := factor(d[[colname]], labels=(1:totalLevels), levels=listOfLabels))
# requires dplyr v.0.7+
# the syntax `!!variablename:=` forces evaluation of the variablename before evaluating the rest of the function
})
}
它有效,每次迭代都会返回整个数据框,相关列 (colname) 更改为因子。但是,每次迭代都会覆盖前一次,所以这个函数只返回i 的最后一个结果。如何确保我最终得到 1 个单个数据框,其中保存了所有相关列?
示例数据(确保注释掉上面函数的第一行,因为我们在这里定义了t):
d <- data.frame("id" = sample(100:999, 10), "age" = sample(18:80, 10), "factor1" = c(rep("a", 3), rep("b", 3), rep("c", 4)), "factor2" = c("x","y","y","y","y","x","x","x","x","y"), stringsAsFactors = FALSE)
t <- data.frame("Item" = c("factor1","factor2"), "Label" = c("This is factor 1", "This is factor 2"), "level1" = c("a","x"), "level2" = c("b","y"), "level3" = c("c","NA"))
【问题讨论】: