【问题标题】:Calculate percentage share of values against a value which is a row observation in the data frame根据数据框中的行观察值计算值的百分比份额
【发布时间】:2019-05-28 00:47:32
【问题描述】:

我想计算百分比份额并使用 mutate 创建新列。我有以下数据:

country, metric, segment, value1990, value2000, value2010
canada, abc, rural, 10, 15, 16
canada, abc, urban, 12, 12, 18
canada, abc, total, 22, 27, 34
canada, xyz, rural, 6, 9, 10
canada, xyc, urban, 7, 8, 8
canada, xyc, total, 13, 17, 18
canada, population, rural, 80, 86, 95
canada, population, urban, 102, 110, 121
canada, population, total, 182, 196, 216

数据框包含来自多个国家和多年的数据。我想创建一个具有以下值的新列

country, metric, segment, value, percent1990, percent2000, percent2010

canada, abc, rural, 10, 15, 16, 12.5%, 17.4%, 16.8%
canada, abc, urban, 12, 12, 18, 11.7%, 10.9%, 14.8%
canada, abc, total, 22, 27, 34, 12.1%, 13.7%, 15.7%
canada, xyz, rural, 6, 9, 10, 7.5%, 10.4%, 10.5%
canada, xyc, urban, 7, 8, 8, 6.8%, 7.2%, 6.6%
canada, xyc, total, 13, 17, 18, 7.22%, 8.6%, 8.3%
canada, population, rural, 80, 86, 95, 100%, 100%, 100%
canada, population, urban, 102, 110, 121, 100%, 100%, 100%
canada, population, total, 182, 196, 216, 100%, 100%, 100%

基本上,我想计算价值变量在人口中所占的百分比,具体取决于它是农村/城市/总人口,跨越多年。

例如 (第 1 行)percent_share = (10/80)*100 = 12.5%

(第 2 行)percent_share = (10/102)*100 = 11.76%

(第 3 行)percent_share = (10/182)*100 = 12.09%

我无法超越 group_by 链接来确定如何输入必要的函数

df = df %>%
     group_by (country, metric) %>%
     mutate(...)

【问题讨论】:

    标签: r dplyr percentage


    【解决方案1】:

    编辑:对于包含年份的新问题数据

    如果您将年份和总人口移至新列,这会更容易。这是一种方法。

    假设您的示例数据位于名为 df1 的数据框中:首先是 gather 年份。

    library(dplyr)
    library(tidyr)
    
    df1 <- df1 %>% gather(Year, Value, 4:6)
    

    然后过滤metric == population 并连接回原始数据。

    df1 %>% filter(metric == "population") %>% 
      left_join(filter(df1, metric != "population"), 
                by = c("country", "segment", "Year")) %>% 
      select(country, segment, Year, population = Value.x, metric = metric.y, value = Value.y)
    

    结果:

       country segment      Year population metric value
    1   canada   rural value1990         80    abc    10
    2   canada   rural value1990         80    xyz     6
    3   canada   urban value1990        102    abc    12
    4   canada   urban value1990        102    xyc     7
    5   canada   total value1990        182    abc    22
    6   canada   total value1990        182    xyc    13
    7   canada   rural value2000         86    abc    15
    8   canada   rural value2000         86    xyz     9
    9   canada   urban value2000        110    abc    12
    10  canada   urban value2000        110    xyc     8
    11  canada   total value2000        196    abc    27
    12  canada   total value2000        196    xyc    17
    13  canada   rural value2010         95    abc    16
    14  canada   rural value2010         95    xyz    10
    15  canada   urban value2010        121    abc    18
    16  canada   urban value2010        121    xyc     8
    17  canada   total value2010        216    abc    34
    18  canada   total value2010        216    xyc    18
    

    然后添加一个mutate:

    df1 %>% filter(metric == "population") %>% 
      left_join(filter(df1, metric != "population"), 
                by = c("country", "segment", "Year")) %>% 
      select(country, segment, Year, population = Value.x, metric = metric.y, value = Value.y) %>% 
      mutate(percent_share = 100 * (value / population))
    

    结果:

       country segment      Year population metric value percent_share
    1   canada   rural value1990         80    abc    10     12.500000
    2   canada   rural value1990         80    xyz     6      7.500000
    3   canada   urban value1990        102    abc    12     11.764706
    4   canada   urban value1990        102    xyc     7      6.862745
    5   canada   total value1990        182    abc    22     12.087912
    6   canada   total value1990        182    xyc    13      7.142857
    7   canada   rural value2000         86    abc    15     17.441860
    8   canada   rural value2000         86    xyz     9     10.465116
    9   canada   urban value2000        110    abc    12     10.909091
    10  canada   urban value2000        110    xyc     8      7.272727
    11  canada   total value2000        196    abc    27     13.775510
    12  canada   total value2000        196    xyc    17      8.673469
    13  canada   rural value2010         95    abc    16     16.842105
    14  canada   rural value2010         95    xyz    10     10.526316
    15  canada   urban value2010        121    abc    18     14.876033
    16  canada   urban value2010        121    xyc     8      6.611570
    17  canada   total value2010        216    abc    34     15.740741
    18  canada   total value2010        216    xyc    18      8.333333
    

    【讨论】:

    • 谢谢!我想过做同样的事情,但因为我拥有多年的价值观而放弃了它(事后我应该在我的问题中提到这一点)。最好的,R
    • 如果您想用更好的示例数据编辑问题,我们可以再试一次。
    • 刚刚更新了问题。感谢您在这方面的帮助!
    • 太棒了!非常感谢!
    【解决方案2】:

    您也可以只按segment 分组并除以max(value),因为总体值应该是最大的:

    df %>% 
      group_by(country, segment) %>% 
      mutate(percent_share = value / max(value))
    
    # A tibble: 9 x 5
    # Groups:   segment [3]
      country metric     segment value percent_share
      <chr>   <chr>      <chr>   <dbl>         <dbl>
    1 canada  abc        rural      10        0.125 
    2 canada  abc        urban      12        0.118 
    3 canada  abc        total      22        0.121 
    4 canada  xyz        rural       6        0.075 
    5 canada  xyc        urban       7        0.0686
    6 canada  xyc        total      13        0.0714
    7 canada  population rural      80        1     
    8 canada  population urban     102        1     
    9 canada  population total     182        1
    

    【讨论】:

      猜你喜欢
      • 2012-09-16
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-02-17
      • 2017-09-29
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多