【问题标题】:Grouping by Multiple variables and summarizing character frequencies按多个变量分组并总结字符频率
【发布时间】:2021-03-29 05:38:02
【问题描述】:

我正在尝试按多个变量对我的数据集进行分组,并构建一个字符变量出现次数的频率表。这是一个示例数据集:

Location    State   County  Job         Pet
            Ohio    Miami   Data        Dog
Urban       Ohio    Miami   Business    Dog, Cat
Urban       Ohio    Miami   Data        Cat
Rural      Kentucky Clark   Data        Cat, Fish
City       Indiana  Shelby  Business    Dog

农村肯塔基克拉克数据狗,鱼 俄亥俄州迈阿密数据狗,猫 城市俄亥俄州迈阿密商务犬猫 农村肯塔基克拉克数据鱼 城市印第安纳谢尔比商务猫

我希望我的输出如下所示:

Location    State   County  Job      Frequency  Pet:Cat Pet:Dog Pet:Fish
            Ohio    Miami   Data        2         1        2       0
 Urban      Ohio    Miami   Business    2         2        2       0
 Urban      Ohio    Miami   Data        1         1        0       0
 Rural    Kentucky  Clark   Data        3         1        1       3
 City     Indiana   Shelby  Business    2         1        1       0

我尝试了以下代码的不同迭代,我接近了,但不太正确:

Output<-df%>%group_by(Location, State, County, Job)%>%
  dplyr::summarise(
    Frequency= dplyr::n(),
    Pet:Cat = count(str_match(Pet, "Cat")),
    Pet:Dog = count(str_match(Pet, "Dog")),
    Pet:Fish = count(str_match(Pet, "Fish")),
    )

任何帮助将不胜感激!提前谢谢你

【问题讨论】:

    标签: r dplyr plyr stringr data-wrangling


    【解决方案1】:

    试试这个:

    library(dplyr)
    library(tidyr)
    #Code
    new <- df %>% 
      separate_rows(Pet,sep=',') %>%
      mutate(Pet=trimws(Pet)) %>%
      group_by(Location,State,County,Job,Pet) %>%
      summarise(N=n()) %>%
      mutate(Pet=paste0('Pet:',Pet)) %>%
      group_by(Location,State,County,Job,.drop = F) %>%
      mutate(Freq=n()) %>%
      pivot_wider(names_from = Pet,values_from=N,values_fill=0)
    

    输出:

    # A tibble: 5 x 8
    # Groups:   Location, State, County, Job [5]
      Location State    County Job       Freq `Pet:Cat` `Pet:Dog` `Pet:Fish`
      <chr>    <chr>    <chr>  <chr>    <int>     <int>     <int>      <int>
    1 ""       Ohio     Miami  Data         2         1         2          0
    2 "City"   Indiana  Shelby Business     2         1         1          0
    3 "Rural"  Kentucky Clark  Data         3         1         1          3
    4 "Urban"  Ohio     Miami  Business     2         2         2          0
    5 "Urban"  Ohio     Miami  Data         1         1         0          0
    

    使用的一些数据:

    #Data
    df <- structure(list(Location = c("", "Urban", "Urban", "Rural", "City", 
    "Rural", "", "Urban", "Rural", "City"), State = c("Ohio", "Ohio", 
    "Ohio", "Kentucky", "Indiana", "Kentucky", "Ohio", "Ohio", "Kentucky", 
    "Indiana"), County = c("Miami", "Miami", "Miami", "Clark", "Shelby", 
    "Clark", "Miami", "Miami", "Clark", "Shelby"), Job = c("Data", 
    "Business", "Data", "Data", "Business", "Data", "Data", "Business", 
    "Data", "Business"), Pet = c("Dog", "Dog, Cat", "Cat", "Cat, Fish", 
    "Dog", "Dog, Fish", "Dog, Cat", "Dog, Cat", "Fish", "Cat")), row.names = c(NA, 
    -10L), class = "data.frame")
    

    【讨论】:

    • 我收到此错误:错误:n() 只能在 dplyr 动词中使用。
    • 我假设 n() 函数被另一个包覆盖?
    • @JeffB 与其他包有些冲突。建议,重新启动R 并仅加载提到的包并使用示例数据df 运行代码。
    • 这适用于我的示例数据,但我收到的是我的真实数据集的此消息:错误:mutate() 输入问题Source。 x 对象“源”未找到 i 输入 Sourcepaste0("Source:", Source)。 i 错误发生在第 1 组:位置 = "",种族 = "",性别 = ""。 Source 相当于我的模拟数据集中的 Pet。感谢您的帮助!
    • @JeffB 看起来源有缺失值,你能检查一下unique(yourdata$Source)吗?
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-06-16
    • 1970-01-01
    • 1970-01-01
    • 2023-01-12
    • 2018-01-14
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多