将数据集导入矩阵答案

【问题标题】：Importing a dataset into a matrix将数据集导入矩阵
【发布时间】：2012-08-08 03:53:02
【问题描述】：

尊敬的 StackOverflow 社区，

我有一个来自我的大学项目的数据集，我正在尝试解析和运行一些计算。它看起来类似于：

Month,1,2,3,3,4,4,5,6,7
x.1,0,0,0,0,0,0,0,0,0
x.2,0,0,0,0,0,0,0,0,0
x.3,0,0,0,6,5,5,,,15
x.4,0,0,0,7,7,,,,15
x.5,1,1,1,11,7,5,,,0
x.6,1,1,1,14,6,,,,0
x.7,1,1,1,17,5,,,,15
x.8,1,1,1,21,4,,,,15
x.9,0,0,0,1,1,1,1,1,0
x.10,0,0,0,1,1,1,1,1,0
x.11,1,0,0,1,1,1,1,1,0
x.12,0,0,0,0,0,0,0,0,1
x.13,0,0,0,0,0,0,0,0,0
x.14,0,1,0,0,0,0,0,0,0
x.20,orchid,,,orchid,rose,orchid,orchid,orchid,
x.23,0,0,0,1,1,1,1,1,1
x.24,,,,,buttercup,buttercup,buttercup,buttercup,lilac
x.25,0,0,0,1,1,0,1,1,1
x.26,,,,17,,,,,15
x.27,,,,999,,,,,15

我尝试像这样导入它：

data <- read.csv("~/data_munging/data.csv", header=F)
my_matrix <- as.matrix(data)

这里的问题是数据集的第一列实际上是变量的名称，as.matrix() 不会将其读取为行（变量）名称。

（有些数据也有漏洞，但我会留下另一个问题）。

我是 R 新手，想知道我在做什么错™？

更新： 根据 Justin 的 cmets，以下是导入数据集及其生成的 str() 的方法：

> sample_data <- read.csv("~/data_munging/sample_data.csv", header=F)
> str(sample_data)
'data.frame':   28 obs. of  10 variables:
 $ V1 : Factor w/ 28 levels "Month","x.1","x.10",..: 1 2 13 22 23 24 25 26 27 28 ...
 $ V2 : Factor w/ 4 levels "","0","1","orchid": 3 2 2 2 2 3 3 3 3 2 ...
 $ V3 : int  2 0 0 0 0 1 1 1 1 0 ...
 $ V4 : int  3 0 0 0 0 1 1 1 1 0 ...
 $ V5 : Factor w/ 12 levels "","0","1","11",..: 8 2 2 9 10 4 5 6 7 3 ...
 $ V6 : Factor w/ 9 levels "","0","1","4",..: 4 2 2 5 7 7 6 5 4 3 ...
 $ V7 : Factor w/ 7 levels "","0","1","4",..: 4 2 2 5 1 5 1 1 1 3 ...
 $ V8 : Factor w/ 6 levels "","0","1","5",..: 4 2 2 1 1 1 1 1 1 3 ...
 $ V9 : Factor w/ 6 levels "","0","1","6",..: 4 2 2 1 1 1 1 1 1 3 ...
 $ V10: Factor w/ 6 levels "","0","1","15",..: 5 2 2 4 4 2 2 4 4 2 ...

我认为它应该是一个矩阵的原因是因为这样它会将Month 作为一个因素读取，并且它的级别是行名而不是飞蛾（一年中的月份）。

更新 2：现在使用 CSV 格式的原始数据集。

【问题讨论】：

很难从中看出你的数据实际上是什么样子的，你能用str(data)和dput(head(data))向我们展示结构吗？为什么你认为它应该在矩阵中？您有字符和数字数据，矩阵只能是一种数据类型。听起来data.frame（已经是这样）更适合您的数据。
它已经不是一个真正合适的数据框了——它的 transpose 是一个数据框（因为每一行都是共享类型）

标签： r csv dataset

【解决方案1】：

矩阵和数据帧有一个转置方法，它返回一个矩阵。：

tdat <- t( read.table(text="Month,1,2,3,3,4,4,5,6,7
 x.1,0,0,0,0,0,0,0,0,0
 x.2,0,0,0,0,0,0,0,0,0
 x.3,0,0,0,6,5,5,,,15
 x.4,0,0,0,7,7,,,,15
 x.5,1,1,1,11,7,5,,,0
 x.6,1,1,1,14,6,,,,0
 x.7,1,1,1,17,5,,,,15
 x.8,1,1,1,21,4,,,,15
 x.9,0,0,0,1,1,1,1,1,0
 x.10,0,0,0,1,1,1,1,1,0
 x.11,1,0,0,1,1,1,1,1,0
 x.12,0,0,0,0,0,0,0,0,1
 x.13,0,0,0,0,0,0,0,0,0
 x.14,0,1,0,0,0,0,0,0,0
 x.20,orchid,,,orchid,rose,orchid,orchid,orchid,
 x.23,0,0,0,1,1,1,1,1,1
 x.24,,,,,buttercup,buttercup,buttercup,buttercup,lilac
 x.25,0,0,0,1,1,0,1,1,1
 x.26,,,,17,,,,,15
 x.27,,,,999,,,,,15", sep=",", header=FALSE, as.is=TRUE) )
 # It might not be immediately obvious that the transpose function converts to matrix
 newdat <- tdat[-1, ]
 colnames(newdat) <- dat[1,]
 newdat <- as.data.frame(newdat)   
# when converted back , everything is factors. Will need to convert to get numeric
  newdat[ , -grep("20|24", names(newdat) ) ] <- 
                    lapply(newdat[ , -grep("20|24", names(newdat) )], 
                             function(x) as.numeric( as.character(x) ))
# Need to use grep to convert character-names to numeric so can use negative indexing
# and used the redundant `as.numeric(as.character(x))` to illustrate good practice.

导致：

> newdat
    Month x.1 x.2 x.3 x.4 x.5 x.6 x.7 x.8 x.9 x.10 x.11 x.12 x.13 x.14   x.20 x.23      x.24 x.25 x.26 x.27
V2      3   2   2   3   3   4   4   3   3   2    2    3    2    2    3 orchid    2              2    1    1
V3      1   1   1   2   2   2   2   2   2   1    1    1    1    1    2   <NA>    1      <NA>    1   NA   NA
V4      2   1   1   2   2   2   2   2   2   1    1    1    1    1    1   <NA>    1      <NA>    1   NA   NA
V5      4   2   2   6   5   5   5   5   5   3    3    3    2    2    3 orchid    3              3    3    3
V6      5   2   2   5   5   7   6   6   6   3    3    3    2    2    3   rose    3 buttercup    3    1    1
V7      5   2   2   5   1   6   1   1   1   3    3    3    2    2    3 orchid    3 buttercup    2    1    1
V8      6   2   2   1   1   1   1   1   1   3    3    3    2    2    3 orchid    3 buttercup    3    1    1
V9      7   2   2   1   1   1   1   1   1   3    3    3    2    2    3 orchid    3 buttercup    3    1    1
V10     8   2   2   4   4   3   3   4   4   2    2    2    3    2    3           3     lilac    3    2    2

我确实注意到有一个 999 值可能是一个缺失值指示符，以及两个不同的值在因子列中缺失。这是 read.table 如何输入列的副作用。它“认为” V3 和 V4 列是数字的，并将连续逗号作为真正的缺失处理，而所有其他列（在换位之前）被视为因子或字符变量，并且连续逗号变成了“”，这是不一样的作为 _NA_character 或因子的 NA。

【讨论】：

感谢@DWin 的输入！我粘贴了我在那里使用的原始 CSV 数据集，以更好地说明问题。