仅将功能应用于特定级别的因素？答案

【问题标题】：Apply function only to certain level of factor?仅将功能应用于特定级别的因素？
【发布时间】：2014-05-07 14:57:36
【问题描述】：

我有一个像这样的数据框：

df <- structure(list(year = c(1990, 1990, 1990, 1990, 1990, 1990, 1990, 
1990, 1990, 1990, 1990, 1990, 1990, 1990, 1990, 1991, 1991, 1991, 
1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 1991, 
1991), group = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
    value = c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 
    13L, 14L, 15L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L, 13L, 14L, 
    15L, 16L, 17L, 18L, 19L)), .Names = c("year", "group", "value"
), row.names = c(NA, -30L), class = "data.frame")


   > df
   year group value
1  1990     A     1
2  1990     A     2
3  1990     A     3
4  1990     A     4
5  1990     A     5
6  1990     A     6
7  1990     B     7
8  1990     B     8
9  1990     B     9
10 1990     B    10
11 1990     B    11
12 1990     B    12
13 1990     B    13
14 1990     B    14
15 1990     B    15
16 1991     A     5
17 1991     A     6
18 1991     A     7
19 1991     A     8
20 1991     A     9
21 1991     A    10
22 1991     A    11
23 1991     A    12
24 1991     A    13
25 1991     A    14
26 1991     B    15
27 1991     B    16
28 1991     B    17
29 1991     B    18
30 1991     B    19

我需要为每一年应用一个函数（我打算使用 plyr 和 summarise 执行此操作），但仅在具有最多行（A 或 B）的因子级别上。有没有办法自动选择每年的这个级别（A 或 B）？

df2 <- ddply(df, .(year), summarise, result="some operation on longest level"))

想要的输出：

> df2
   year group value result
1  1990     B     7     5
2  1990     B     8     4
3  1990     B     9     5
4  1990     B    10     3
5  1990     B    11     3
6  1990     B    12     8
7  1990     B    13    11
8  1990     B    14     7  
9  1990     B    15     2
10 1991     A     5    10
11 1991     A     6    13
12 1991     A     7     9
13 1991     A     8     7
14 1991     A     9     6
15 1991     A    10     1
16 1991     A    11    15 
17 1991     A    12     5
18 1991     A    13     5
19 1991     A    14     2

【问题讨论】：

您可以使用table 开始。例如。 lapply(split(df, df$year), function(x) table(x$group))

标签： r

【解决方案1】：

这可能是dplyr 的另一种方法

library(dplyr)

df <- df %.% group_by(year,group) %.% mutate(count = n()) %.% ungroup()
df <- df %.% group_by(year) %.% filter(count %in% max(count)) %.% mutate(result = sqrt(value))
df$count <- NULL

因为我不确定您要对result 应用什么功能，所以我使用sqrt(value) 就像@rbatt 的回答一样

【讨论】：

太棒了，非常感谢！我能够修改它以适合我的真实数据，我之前有点不愿意使用 dplyr 但现在我确信：D

【解决方案2】：

抱歉，我自己不使用 plyr，但这是我可以使用基本函数的方法。也许这会为您提供 plyr 解决方案。

#find largest groups for each year
maxgroups <- tapply(df$group, df$year, function(x) which.max(table(x)))
#create group names
maxpairs <- paste(names(maxgroups),levels(df$group)[maxgroups], sep=".")

#helper function
ifnotin<-function(val,set,ifnotin) {out<-val; out[!val%in%set]<-ifnotin; droplevels(out)}
#new factor indicating best group
tgroups <- ifnotin(interaction(df$year, df$group), maxpairs, NA)

#now transform the best groups by adding year to result (or whatever transformation you need to do)
transform(df, value=ifelse(!is.na(tgroups), value+year, value))

我不确定您的转型是否需要知道它适用于哪个组/年份。如果您只需要知道它是否在需要转换的组中，您可以跳过tgroups 并使用

needstransform <- interaction(df$year, df$group) %in% maxpairs

但tgroups 的 NA 值适合于总结 tapply(df$value, droplevels(tgroups), mean) 等

【讨论】：

使用interaction的好主意。

【解决方案3】：

我认为这不是一个很好的答案，因为它被超级混淆了（而且它没有使用您想要的 plyr 方法），但也许它会激发其他人的思考：

基本上，您只需要知道每年要查看group 的哪些值。假设您计算出这些值并将这些值（与 year 拆分原始数据的顺序相同）存储在一个名为 m 的变量中，然后您可以 mapply 某个函数将每个拆分（数据的按年份）由group 执行，然后执行您想要的任何其他计算。

do.call(rbind, mapply(function(x,y) { 
                          tmp <- x[x$group==y,]
                          #fun(tmp) # apply your function to the relevant subset
                      }, split(df,df$year), m, SIMPLIFY=FALSE))

我想到了三种不同的方法可以生成m。他们在这里：

m <- with(df, levels(group)[apply(table(group, year), 2, which.max)])

m <- levels(df$group)[sapply(split(df, df$year), function(x) which.max(sapply(split(x, x$group), nrow)))]

m <- with(df, levels(group)[apply(tapply(year, list(group, year), length),2,which.max)])

【讨论】：

只是一个小花絮，我相信 ddply 来自 plyr，而 dplyr 是一个“改进”的 plyr 包

【解决方案4】：

这是我想出的：

df2 <- ddply(
        df, 
        .(year), 
        summarise, 
        result=sqrt(
            value[group==names(which.max(table(df$group)))]
        )
    )

【讨论】：

您似乎在这里进行了一些聚合，我认为这不是 OP 要求的。
我改变了正在应用的函数——现在我取平方根。但我可能误解了一些东西......