如何在 R 中整理这个凌乱的数据集答案

【问题标题】：How to tidy this messy dataset in R如何在 R 中整理这个凌乱的数据集
【发布时间】：2018-07-17 13:22:21
【问题描述】：

我对使用tidyr、dplyr 等还是很陌生，而且我有一些数据我不知道如何在R 中整理。

变量在行和列中混合在一起，并且电子表格看起来像是被拆分的，因此顶行和底行有不同类型的信息。

它的简化版本如下。

您可以想象这是一个有 4 个问题的考试：

前几行提供了有关每个问题的一些信息
最后几行显示不同学生（由他们的IDNum 给出）的问题是正确的 (1) 还是错误的 (0)。

这是原始数据：

Question    Q1         Q2         Q3         Q4
Topic       English    English    Math       Math
Subtopic    Grammar    Vocabulary Algebra    Geometry
Difficulty  2          4          3          4
IDNum               
512         1          1          1          0
102         0          1          0          1
321         1          1          1          1
246         1          1          0          1
248         1          0          1          0
136         1          1          1          1
290         0          1          1          1
753         1          0          0          0
752         1          0          1          1

我想整理一下这个数据集。它看起来像下面这样：

IDNum   Question    Topic   Subtopic    Difficulty  Correct
512     Q1          English Grammar     2           1
512     Q2          English Vocabulary  4           1
512     Q3          Math    Algebra     3           1
512     Q4          Math    Geometry    4           0
102     Q1          English Grammar     2           0
102     Q2          English Vocabulary  4           1
102     Q3          Math    Algebra     3           0
102     Q4          Math    Geometry    4           1
321     Q1          English Grammar     2           1
321     Q2          English Vocabulary  4           1
321     Q3          Math    Algebra     3           1
321     Q4          Math    Geometry    4           1

等等。

谢谢！

【问题讨论】：

最好使用 dput() 共享数据。
如果您是从 excel 中读取的，请附上您用来读取它的代码

标签： r tidyr

【解决方案1】：

目前尚不完全清楚您拥有数据的格式，但希望以下内容会有所帮助：

数据

df <- read.table(text="
Question    Q1         Q2         Q3         Q4
Topic       English    English    Math       Math
Subtopic    Grammar    Vocabulary Algebra    Geometry
Difficulty  2          4          3          4
IDNum       ''        ''          ''         ''
512         1          1          1          0
102         0          1          0          1
321         1          1          1          1
246         1          1          0          1
248         1          0          1          0
136         1          1          1          1
290         0          1          1          1
753         1          0          0          0
752         1          0          1          1",h=F,strin=F)

解决方案

library(tidyverse)
df %>%
  # collapse the first rows into column names to prepare for gather/separate combo
  setNames(apply(.[1:4,],2,paste,collapse="|")) %>% 
  rename_at(1,~"IDNum")   %>%
  # remove useless rows
  slice(-(1:5))           %>%
  # change IDNum to factor, only useful if the order of IDNum is important (probably it's not)
  mutate_at("IDNum",~factor(.x,levels=unique(.x))) %>%
  # wide to long
  gather(key,correct,-1)  %>%
  # build your columns (convert to TRUE so Difficulty will be numeric)
  separate(key,df[1:4,1],convert = TRUE) %>%
  # convert correct to numeric
  mutate_at("correct",as.numeric) %>%
  # sort
  arrange(IDNum)

# # A tibble: 36 x 6
#     IDNum Question   Topic   Subtopic Difficulty correct
#    <fctr>    <chr>   <chr>      <chr>      <int>   <dbl>
#  1    512       Q1 English    Grammar          2       1
#  2    512       Q2 English Vocabulary          4       1
#  3    512       Q3    Math    Algebra          3       1
#  4    512       Q4    Math   Geometry          4       0
#  5    102       Q1 English    Grammar          2       0
#  6    102       Q2 English Vocabulary          4       1
#  7    102       Q3    Math    Algebra          3       0
#  8    102       Q4    Math   Geometry          4       1
#  9    321       Q1 English    Grammar          2       1
# 10    321       Q2 English Vocabulary          4       1
# # ... with 26 more rows

另一种方法，多一些步骤，但可能更直观，将表头和表的核心从开头分开。

我们从标头（我们转置）创建一个查找，稍后我们将在收集的数据上使用它：

header_lkp <-
  as_tibble(t(df[1:4,])) %>%
  setNames(.[1,]) %>%
  slice(-1)

df_core <-
  df %>%
  setNames(.[1,]) %>%
  slice(-(1:5))   %>%
  rename_at(1,~"IDNum") %>%
  mutate_at("IDNum",~factor(.x,levels=unique(.x)))

df_core %>%
  gather(Question,correct,-IDNum) %>%
  mutate_at("correct",as.numeric) %>%
  left_join(header_lkp,by="Question") %>%
  arrange(IDNum)

（相同的输出）

【讨论】：