使用 R 解析出 Surveymonkey csv 文件答案

【问题标题】：Using R to parse out Surveymonkey csv files使用 R 解析出 Surveymonkey csv 文件
【发布时间】：2011-12-14 11:16:54
【问题描述】：

我正在尝试分析使用surveymonkey 创建的大型调查，该调查在CSV 文件中有数百列，并且由于标题超过两行，因此输出格式难以使用。

有没有人找到一种简单的方法来管理 CSV 文件中的标头以便分析易于管理？
其他人如何分析 Surveymonkey 的结果？

谢谢！

【问题讨论】：

您能否发布一个small 的Surveymonkey 输出示例来演示该问题？我可以想象一个解决方案，它使用readLines 和n=2 来读取（和按摩）标题，并使用read.csv 和skip=2, header=FALSE 来获取数据......
下次进行调查时，请使用 LimeSurvey (limesurvey.org) - 它是开源的，并且有一个运行良好的导出到 R 工具（披露：我编写了导出模块）跨度>
@Ben，文件中的标题是两行问题名称/编号，然后在下面一行写出子问题。总的来说，处理起来总让人头疼。
@Sean，在我的组织内，我通常会提取 *.sav（您需要一个付费帐户），因为 csv 很难使用。 SPSS 文件可能有些不稳定，但清理起来还不错（@Andrie，也在为此努力:)）。
@Ben，在尝试创建一个小示例时，我发现 Surveymonkey CSV 文件的第二行似乎以 Null 字符开头，而当我使用 read.csv() 时，R 忽略了这一行或读取线（）。 Libreoffice 可以阅读这一行！让我发疯了一段时间！有什么建议吗？

标签： r parsing csv surveymonkey

【解决方案1】：

您可以从 Surveymonkey 以适合 R 的便捷形式将其导出，请参阅“高级电子表格格式”中的下载回复

【讨论】：

【解决方案2】：

我最后所做的是使用标记为 V1、V2 等的 libreoffice 打印出标题，然后我只是在文件中读取为

 m1 <- read.csv('Sheet1.csv', header=FALSE, skip=1)

然后只是对 m1$V10、m1$V23 等进行了分析...

为了解决多列的混乱，我使用了以下小函数

# function to merge columns into one with a space separator and then
# remove multiple spaces
mcols <- function(df, cols) {
    # e.g. mcols(df, c(14:18))
        exp <- paste('df[,', cols, ']', sep='', collapse=',' )
        # this creates something like...
        # "df[,14],df[,15],df[,16],df[,17],df[,18]"
        # now we just want to do a paste of this expression...
        nexp <- paste(" paste(", exp, ", sep=' ')")
        # so now nexp looks something like...
        # " paste( df[,14],df[,15],df[,16],df[,17],df[,18] , sep='')"
        # now we just need to parse this text... and eval() it...
        newcol <- eval(parse(text=nexp))
        newcol <- gsub('  *', ' ', newcol) # replace duplicate spaces by a single one
        newcol <- gsub('^ *', '', newcol) # remove leading spaces
        gsub(' *$', '', newcol) # remove trailing spaces
}
# mcols(df, c(14:18))

毫无疑问会有人能把它清理干净！

整理我使用的类似李克特的量表：

# function to tidy c('Strongly Agree', 'Agree', 'Disagree', 'Strongly Disagree')
tidylik4 <- function(x) {
  xlevels <- c('Strongly Disagree', 'Disagree', 'Agree', 'Strongly Agree')
  y <- ifelse(x == '', NA, x)
  ordered(y, levels=xlevels)
}

for (i in 44:52) {
  m2[,i] <- tidylik4(m2[,i])
}

请随意发表评论，因为这无疑会再次出现！

【讨论】：

【解决方案3】：

截至 2013 年 11 月，网页布局似乎发生了变化。选择Analyze results > Export All > All Responses Data > Original View > XLS+ (Open in advanced statistical and analytical software)。然后转到导出并下载文件。您将获得原始数据，第一行 = 问题标题/接下来的每一行 = 1 个响应，如果您有很多响应/问题，可能会拆分为多个文件。

【讨论】：

【解决方案4】：

我必须经常处理这个问题，并且将标题放在两列上有点痛苦。此功能修复了该问题，因此您只需处理 1 行标题。它还加入了多选题，因此您拥有顶部：底部样式命名。

#' @param x The path to a surveymonkey csv file
fix_names <- function(x) {
  rs <- read.csv(
    x,
    nrows = 2,
    stringsAsFactors = FALSE,
    header = FALSE,
    check.names = FALSE, 
    na.strings = "",
    encoding = "UTF-8"
  )

  rs[rs == ""] <- NA
  rs[rs == "NA"] <- "Not applicable"
  rs[rs == "Response"] <- NA
  rs[rs == "Open-Ended Response"] <- NA

  nms <- c()

  for(i in 1:ncol(rs)) {

    current_top <- rs[1,i]
    current_bottom <- rs[2,i]

    if(i + 1 < ncol(rs)) {
      coming_top <- rs[1, i+1]
      coming_bottom <- rs[2, i+1]
    }

    if(is.na(coming_top) & !is.na(current_top) & (!is.na(current_bottom) | grepl("^Other", coming_bottom)))
      pre <- current_top

    if((is.na(current_top) & !is.na(current_bottom)) | (!is.na(current_top) & !is.na(current_bottom)))
      nms[i] <- paste0(c(pre, current_bottom), collapse = " - ")

    if(!is.na(current_top) & is.na(current_bottom))
      nms[i] <- current_top

  }


  nms
}

如果您记下，它只返回名称。我通常只是用...,skip=2, header = FALSE 读取.csv，保存到变量并覆盖变量的名称。它还有助于设置您的 na.strings 和 stringsAsFactor = FALSE。

nms = fix_names("path/to/csv")
d = read.csv("path/to/csv", skip = 2, header = FALSE)
names(d) = nms

【讨论】：

【解决方案5】：

标题的问题是带有“选择所有适用”的列将有一个空白的顶行，而列标题将是下面的行。这只是这类问题的问题。

考虑到这一点，我编写了一个循环来遍历所有列，如果列名为空白（字符长度为 1），则将列名替换为第二行中的值。

然后，你可以杀掉第二行数据，得到一个整洁的数据框。

for(i in 1:ncol(df)){
newname <- colnames(df)[i]
if(nchar(newname) < 2){
colnames(df)[i] <- df[1,i]
} 

df <- df[-1,]

【讨论】：

【解决方案6】：

迟到了，但这仍然是一个问题，我发现的最佳解决方法是使用函数根据重复值将列名和子列名粘贴在一起。

例如，如果导出到 .csv，在 RStudio 中重复的列名将自动替换为 X。如果导出到.xlsx，则重复值为...。

这是base R 解决方案：

sm_header_function <- function(x, rep_val){
  
  orig <- x
  
  sv <- x
  sv <- sv[1,]
  sv <- sv[, sapply(sv, Negate(anyNA)), drop = FALSE]
  sv <- t(sv)
  sv <- cbind(rownames(sv), data.frame(sv, row.names = NULL))
  names(sv)[1] <- "name"
  names(sv)[2] <- "value"
  sv$grp <- with(sv, ave(name, FUN = function(x) cumsum(!startsWith(name, rep_val))))
  sv$new_value <- with(sv, ave(name, grp, FUN = function(x) head(x, 1)))
  sv$new_value <- paste0(sv$new_value, " ", sv$value)
  new_names <- as.character(sv$new_value)
  colnames(orig)[which(colnames(orig) %in% sv$name)] <- sv$new_value
  orig <- orig[-c(1),]
  return(orig)
}

sm_header_function(df, "X")
sm_header_function(df, "...")

对于一些示例数据，列名的变化如下所示：

SurveyMonkey 的原始导出：

> colnames(sample)
 [1] "Respondent ID"                                 "Please provide your contact information:"      "...11"                                        
 [4] "...12"                                         "...13"                                         "...14"                                        
 [7] "...15"                                         "...16"                                         "...17"                                        
[10] "...18"                                         "...19"                                         "I wish it would have snowed more this winter."

从 SurveyMonkey 清理导出：

> colnames(sample_clean)
 [1] "Respondent ID"                                            "Please provide your contact information: Name"           
 [3] "Please provide your contact information: Company"         "Please provide your contact information: Address"        
 [5] "Please provide your contact information: Address 2"       "Please provide your contact information: City/Town"      
 [7] "Please provide your contact information: State/Province"  "Please provide your contact information: ZIP/Postal Code"
 [9] "Please provide your contact information: Country"         "Please provide your contact information: Email Address"  
[11] "Please provide your contact information: Phone Number"    "I wish it would have snowed more this winter. Response"

样本数据：

structure(list(`Respondent ID` = c(NA, 11385284375, 11385273621, 
11385258069, 11385253194, 11385240121, 11385226951, 11385212508
), `Please provide your contact information:` = c("Name", "Benjamin Franklin", 
"Mae Jemison", "Carl Sagan", "W. E. B. Du Bois", "Florence Nightingale", 
"Galileo Galilei", "Albert Einstein"), ...11 = c("Company", "Poor Richard's", 
"NASA", "Smithsonian", "NAACP", "Public Health Co", "NASA", "ThinkTank"
), ...12 = c("Address", NA, NA, NA, NA, NA, NA, NA), ...13 = c("Address 2", 
NA, NA, NA, NA, NA, NA, NA), ...14 = c("City/Town", "Philadelphia", 
"Decatur", "Washington", "Great Barrington", "Florence", "Pisa", 
"Princeton"), ...15 = c("State/Province", "PA", "Alabama", "D.C.", 
"MA", "IT", "IT", "NJ"), ...16 = c("ZIP/Postal Code", "19104", 
"20104", "33321", "1230", "33225", "12345", "8540"), ...17 = c("Country", 
NA, NA, NA, NA, NA, NA, NA), ...18 = c("Email Address", "benjamins@gmail.com", 
"mjemison@nasa.gov", "stargazer@gmail.com", "dubois@web.com", 
"firstnurse@aol.com", "galileo123@yahoo.com", "imthinking@gmail.com"
), ...19 = c("Phone Number", "215-555-4444", "221-134-4646", 
"999-999-4422", "999-000-1234", "123-456-7899", "111-888-9944", 
"215-999-8877"), `I wish it would have snowed more this winter.` = c("Response", 
"Strongly disagree", "Strongly agree", "Neither agree nor disagree", 
"Strongly disagree", "Disagree", "Agree", "Strongly agree")), row.names = c(NA, 
-8L), class = c("tbl_df", "tbl", "data.frame"))

【讨论】：

【解决方案7】：

以下内容如何：将read.csv() 与header=FALSE 一起使用。制作两个数组，一个包含两行标题，一个包含调查答案。然后paste()将两行/句子放在一起。最后，使用colnames()。

【讨论】：

由于第二行以空字符开头，恐怕这行不通。
if(!is.null(second.line)) { paste(first.line, second.line) } 怎么样？
不幸的是，尽管 second.line 以空字符开头，但仍有有用的信息！