【问题标题】:R: determining difference between max and min for certain subgroups within groupsR:确定组内某些子组的最大值和最小值之间的差异
【发布时间】:2018-08-21 19:25:28
【问题描述】:

下面的示例数据字段

Event        Ethnicity        Score
50 yd dash    Asian             7
50 yd dash    Afr. Am           8
50 yd dash    White             5
Hurdle        Asian             6
Hurdle        Afr. Am           8
Hurdle        White             9

我正在尝试确定每个事件中某些种族之间的差异,希望使用 dplyr 或 tidyverse 中的其他东西,但会得到任何答案/帮助。比如每个赛事中亚裔组和白人组的区别,

例如,亚洲 (7) - 白色 (5) = 差异 (2),

产生类似于以下的输出:

Event          Difference
50 yd dash         2
Hurdle            -3

【问题讨论】:

  • 另外,标题说的是最大和最小,但这实际上并不是完全必要的,因为我可能会选择组而不是取最大和最小,但最大和最小解决方案也会有帮助.

标签: r dplyr grouping tidyverse difference


【解决方案1】:

使用以下内容应该可以帮助您:

library(tidyverse)

df %>%
    spread(Ethnicity, Score) %>%
    mutate("Difference" = Asian - White) %>%
    select(-Asian, -White, -`Afr. Am`)
#       Event Difference
#1 50 yd dash          2
#2     Hurdle         -3

数据。

df <-
structure(list(Event = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("50 yd dash", 
"Hurdle"), class = "factor"), Ethnicity = structure(c(2L, 1L, 
3L, 2L, 1L, 3L), .Label = c("Afr. Am", "Asian", "White"), class = "factor"), 
    Score = c(7L, 8L, 5L, 6L, 8L, 9L)), class = "data.frame", row.names = c(NA, 
-6L))

@AntoniosK 已经发布了 read.table 读取 OP 发布的数据的方法,但我的方法有点不同。我没有从列的值中删除空格,而是将它们放在单引号之间。 (它必须是引号,因为指令将参数text的值放在引号之间。)

df <- read.table(text = "
Event        Ethnicity        Score
'50 yd dash'    Asian             7
'50 yd dash'    'Afr. Am'           8
'50 yd dash'    White             5
Hurdle        Asian             6
Hurdle        'Afr. Am'           8
Hurdle        White             9
", header = TRUE)

【讨论】:

  • 经过测试,是的,它有效。我将使用dput 格式的数据编辑您的答案。
  • 出于好奇:您是如何获得数据的?我想我在这里遗漏了一个我仍然不知道的重要功能。
  • 这太完美了!谢谢你。下次我将为df编写代码
  • 这实际上是我的错。我不知道检索它的功能。编码愉快!
  • 查看编辑,我相信它回答了您在阅读以表格格式发布的数据时遇到的部分问题。
【解决方案2】:

数据

df = read.table(text = "
Event        Ethnicity        Score
50yddash    Asian             7
50yddash    Afr.Am           8
50yddash    White             5
Hurdle        Asian             6
Hurdle        Afr.Am           8
Hurdle        White             9
", header=T, stringsAsFactors=F)

第一种方法,您可以手动指定感兴趣的种族:

library(dplyr)

df %>%
  group_by(Event) %>%
  summarise(Diff = Score[Ethnicity=="Asian"] - Score[Ethnicity=="White"])

# # A tibble: 2 x 2
#   Event     Diff
#   <chr>    <int>
# 1 50yddash     2
# 2 Hurdle      -3

您可以将这段代码用作函数(输入感兴趣的两个种族)。

第二种方法,为种族和事件的所有独特组合创建所有差异:

library(tidyverse)

# create vectorised function that calculates the difference
# based on a given event and ethnicities
f = function(event, eth1, eth2) {
  df$Score[df$Event==event & df$Ethnicity==eth1] -
  df$Score[df$Event==event & df$Ethnicity==eth2] }
f = Vectorize(f)


data.frame(t(combn(unique(df$Ethnicity), 2)), stringsAsFactors = F) %>% # create combinations of ethnicities
  mutate(Event = list(unique(df$Event))) %>%                            # create combinations with events
  unnest() %>%
  mutate(Diff = f(Event, X1, X2))                                       # apply the function

#    X1     X2    Event Diff
# 1  Asian Afr.Am 50yddash   -1
# 2  Asian Afr.Am   Hurdle   -2
# 3  Asian  White 50yddash    2
# 4  Asian  White   Hurdle   -3
# 5 Afr.Am  White 50yddash    3
# 6 Afr.Am  White   Hurdle   -1

此过程使用字母顺序来创建独特的差异。如果你想要所有这些(即亚洲白人和白人亚洲人),你可以使用这个

expand.grid(Event = unique(df$Event),
            X1 = unique(df$Ethnicity),
            X2 = unique(df$Ethnicity)) %>%
  filter(X1 != X2) %>%
  mutate(Diff = f(Event, X1, X2))                                     

#    Event     X1     X2 Diff
# 1  50yddash Afr.Am  Asian    1
# 2    Hurdle Afr.Am  Asian    2
# 3  50yddash  White  Asian   -2
# 4    Hurdle  White  Asian    3
# 5  50yddash  Asian Afr.Am   -1
# 6    Hurdle  Asian Afr.Am   -2
# 7  50yddash  White Afr.Am   -3
# 8    Hurdle  White Afr.Am    1
# 9  50yddash  Asian  White    2
# 10   Hurdle  Asian  White   -3
# 11 50yddash Afr.Am  White    3
# 12   Hurdle Afr.Am  White   -1

【讨论】:

  • 这太棒了!谢谢!
【解决方案3】:
df %>%
  mutate(rn = row_number()) %>%
  spread(Ethnicity, Score) %>%
  group_by(Event) %>%
  summarise(Difference = max(Asian, na.rm = T) - max(White, na.rm = T))

# # A tibble: 2 x 2
#   Event      Difference
#   <chr>           <dbl>
# 1 50 yd dash          2
# 2 Hurdle             -3

数据:

df <- 
structure(list(Event = c("50 yd dash", "50 yd dash", "50 yd dash", 
"Hurdle", "Hurdle", "Hurdle"), Ethnicity = c("Asian", "Afr. Am", 
"White", "Asian", "Afr. Am", "White"), Score = c(7, 8, 5, 6, 
8, 9)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))

【讨论】:

  • 太棒了!也感谢您将数据编码为适当的 df
猜你喜欢
  • 2017-03-04
  • 1970-01-01
  • 1970-01-01
  • 2021-06-03
  • 2017-09-28
  • 2020-04-30
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多