R中多列和多行的表频率答案

【问题标题】：Table frequency from multiple col and multiple row in RR中多列和多行的表频率
【发布时间】：2017-03-25 03:41:59
【问题描述】：

我正在尝试从此数据框中获取频率表：

tmp2 <- structure(list(a1 = c(1L, 0L, 0L), a2 = c(1L, 0L, 1L),
                       a3 = c(0L, 1L, 0L), b1 = c(1L, 0L, 1L),
                       b2 = c(1L, 0L, 0L), b3 = c(0L, 1L, 1L)),
                       .Names = c("a1", "a2", "a3", "b1", "b2", "b3"),
                       class = "data.frame", row.names = c(NA, -3L))


tmp2 <- read.csv("tmp2.csv", sep=";")
tmp2
> tmp2
  a1 a2 a3 b1 b2 b3
1  1  1  0  1  1  0
2  0  0  1  0  0  1
3  0  1  0  1  0  1

我尝试获取如下频率表：

table(tmp2[,1:3], tmp2[,4:6])

但我明白了：

sort.list(y) 中的错误：对于“sort.list”，“x”必须是原子的
您是否在列表中调用了“排序”？

预期输出：

信息：不需要方阵，例如我应该能够添加 b4 b5 并保留 a1 a2 a3

【问题讨论】：

为什么是a2 b1 2？
在 tmp2 中支持 1 行 = 1 个客户端。所以 2 个客户有 a2 和 b1
crossprod 在这里也很有用； crossprod(as.matrix(tmp2[1:3]), as.matrix(tmp2[4:6]))

标签： r frequency

【解决方案1】：

一个选项：

matrix(colSums(tmp2[,rep(1:3,3)] & tmp2[,rep(4:6,each=3)]),
       ncol=3,nrow=3,
       dimnames=list(colnames(tmp2)[1:3],colnames(tmp2)[4:6]))
#   b1 b2 b3
#a1  1  1  0
#a2  2  1  1
#a3  0  0  1

如果a和b的列数不同，可以试试：

acols<-1:3 #state the indices of the a columns
bcols<-4:6 #same for b; if you add a column this should be 4:7
matrix(colSums(tmp2[,rep(acols,length(bcols))] & tmp2[,rep(bcols,each=length(acols))]),
           ncol=length(bcols),nrow=length(acols),
           dimnames=list(colnames(tmp2)[acols],colnames(tmp2)[bcols]))

【讨论】：

您好，谢谢，这很有趣。我有个问题。如果我有例如 a1 a2 a3 和 b1 b2 b3 b4 ，那会起作用吗？（就是说加b4）？

【解决方案2】：

这是一个可能的解决方案：

aIdxs <- 1:3
bIdxs <- 4:7

# init matrix
m <- matrix(0,
            nrow = length(aIdxs), ncol=length(bIdxs),
            dimnames = list(colnames(tmp2)[aIdxs],colnames(tmp2)[bIdxs]))

# create all combinations of a's and b's column indexes
idxs <- expand.grid(aIdxs,bIdxs)

# for each line and for each combination we add 1
# to the matrix if both a and b column are 1 
for(r in 1:nrow(tmp2)){
  m <- m + matrix(apply(idxs,1,function(x){ all(tmp2[r,x]==1) }),
                  nrow=length(aIdxs), byrow=FALSE)
}
> m
   b1 b2 b3
a1  1  1  0
a2  2  1  1
a3  0  0  1

【讨论】：

【解决方案3】：

这里是另一种可能的解决方案。您的输入对于“表”来说有点棘手，因为您天生就有两组“a”和“b”，每行中的二进制指示符仅指示“a”和“b”之间的成对实例，并且您希望遍历它们.下面是一个通用的（但可能不是那么优雅）的函数，它适用于不同长度的 'a's 和 'b's：

tmp2 <- structure(list(a1 = c(1L, 0L, 0L), a2 = c(1L, 0L, 1L), a3 = c(0L, 
                                                              1L, 0L), b1 = c(1L, 0L, 1L), b2 = c(1L, 0L, 0L), b3 = c(0L, 1L, 
                                                                                                                      1L)), .Names = c("a1", "a2", "a3", "b1", "b2", "b3"), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                                                -3L))                                                                                                                                                                                                               
fun = function(x) t(do.call("cbind", lapply(x[,grep("a", colnames(x))], 
    function(p) rowSums(do.call("rbind", lapply(x[,grep("b", colnames(x))], 
    function(q) q*p ))))))
fun(tmp2)
#> fun(tmp2)
#   b1 b2 b3
#a1  1  1  0
#a2  2  1  1
#a3  0  0  1

# let's do a bigger example
set.seed(1)
m = matrix(rbinom(size=1, n=50, prob=0.75), ncol=10, dimnames=list(paste("instance_", 1:5, sep=""), c(paste("a",1:4,sep=""), paste("b",1:6,sep=""))))

# Notice that the count of possible a and b elements are not equal
#> m
#           a1 a2 a3 a4 b1 b2 b3 b4 b5 b6
#instance_1  1  0  1  1  0  1  1  1  0  0
#instance_2  1  0  1  1  1  1  1  0  1  1
#instance_3  1  1  1  0  1  1  1  1  0  1
#instance_4  0  1  1  1  1  0  1  1  1  1
#instance_5  1  1  0  0  1  1  0  1  1  1

fun(as.data.frame(m))
#> fun(as.data.frame(m))
#   b1 b2 b3 b4 b5 b6
#a1  3  4  3  3  2  3
#a2  3  2  2  3  2  3
#a3  3  3  4  3  2  3
#a4  2  2  3  2  2  2

【讨论】：