将变量添加到 R 中的数据框并按所述变量排序答案

【问题标题】：Adding a variable to a dataframe in R and sorting by said variable将变量添加到 R 中的数据框并按所述变量排序
【发布时间】：2020-10-02 15:44:38
【问题描述】：

我正在使用来自Ecdat 库的Cigarettes 数据框。我试图首先通过 dplyr 使用 mutate 函数在数据框中创建一个变量，即人均收入（即收入/人口）。然后我想按州个人人均收入（即州人口）对数据进行排名，以便排名为 1 的行的人均收入最高。

似乎我可以使用 mutate(Cigarette,income_population =income/pop) 创建变量。虽然在指定按新的收入人口排名时，排名函数似乎不起作用。

有什么建议吗？

【问题讨论】：

你能分享使用 dput() 的可重现示例
@sehoskins 这似乎是 arrange(Cigarette, income_population) 的工作

标签： r dplyr

【解决方案1】：

鉴于完整的 Cigarette 数据集 (https://github.com/cran/Ecdat/blob/master/data/Cigarette.rda)：

library(dplyr)
Cigarette %>%
  mutate(income_population = income / pop) %>%
  arrange(desc(income_population)) %>%
  head(.)
#   state year   cpi     pop   packpc    income   tax   avgprs     taxs income_population
# 1    CT 1995 1.524 3265293 79.47219 104315120 74.00 218.2805 86.35550          31.94663
# 2    CT 1994 1.482 3268346 77.62336  99787808 71.00 215.9573 83.22400          30.53159
# 3    CT 1993 1.445 3272325 79.79036  96866464 67.00 214.8885 79.16350          29.60172
# 4    NJ 1995 1.524 7965523 80.37137 233208576 64.00 203.0872 75.49550          29.27725
# 5    CT 1992 1.403 3274997 84.24435  93778704 63.75 209.2263 75.59300          28.63475
# 6    MA 1995 1.524 6062335 76.62064 170051568 75.00 217.1050 85.33833          28.05051

小数据：

# dput(head(Cigarette))
structure(list(state = structure(1:6, .Label = c("AL", "AR", "AZ", "CA", "CO", "CT", "DE", "FL", "GA", "IA", "ID", "IL", "IN", "KS", "KY", "LA", "MA", "MD", "ME", "MI", "MN", "MO", "MS", "MT", "NC", "ND", "NE", "NH", "NJ", "NM", "NV", "NY", "OH", "OK", "OR", "PA", "RI", "SC", "SD", "TN", "TX", "UT", "VA", "VT", "WA", "WI", "WV", "WY"), class = "factor"), year = c(1985L, 1985L, 1985L, 1985L, 1985L, 1985L), cpi = c(1.07599997520447, 1.07599997520447, 1.07599997520447, 1.07599997520447, 1.07599997520447, 1.07599997520447), pop = c(3973000L, 2327000L, 3184000L, 26444000L, 3209000L, 3201000L), packpc = c(116.486282348633, 128.534591674805, 104.522613525391, 100.363037109375, 112.963539123535, 109.278350830078), income = c(46014968L, 26210736L, 43956936L, 447102816L, 49466672L, 60063368L), tax = c(32.5000038146973, 37, 31, 26, 31, 42), avgprs = c(102.181671142578, 101.474998474121, 108.578750610352, 107.837341308594, 94.2666625976563, 128.024993896484), taxs = c(33.3483352661133, 37, 36.1704177856445, 32.1040000915527, 31, 51.4833335876465)), row.names = c("1", "2", "3", "4", "5", "6"), class = "data.frame")

以及给出删节数据的结果：

head(Cigarette) %>%
  mutate(income_population = income / pop) %>%
  arrange(desc(income_population))
#   state year   cpi      pop   packpc    income  tax    avgprs     taxs income_population
# 1    CT 1985 1.076  3201000 109.2784  60063368 42.0 128.02499 51.48333          18.76394
# 2    CA 1985 1.076 26444000 100.3630 447102816 26.0 107.83734 32.10400          16.90753
# 3    CO 1985 1.076  3209000 112.9635  49466672 31.0  94.26666 31.00000          15.41498
# 4    AZ 1985 1.076  3184000 104.5226  43956936 31.0 108.57875 36.17042          13.80557
# 5    AL 1985 1.076  3973000 116.4863  46014968 32.5 102.18167 33.34834          11.58192
# 6    AR 1985 1.076  2327000 128.5346  26210736 37.0 101.47500 37.00000          11.26375

【讨论】：

【解决方案2】：

假设您实际上想要添加一个包含排名的变量，并且 1 是最高排名（为清楚起见，显示少于所有列，并且仅显示前 10 行）

library(Ecdat)
library(dplyr)

Cigarette %>% 
   mutate(income_population = income/pop) %>% 
   arrange(desc(income_population)) %>% 
   mutate(inc_pop_rank = row_number(-income_population)) %>%
   slice(1:10) %>%
   select(state, year, income_population, inc_pop_rank)

   state year income_population inc_pop_rank
1     CT 1995          31.94663            1
2     CT 1994          30.53159            2
3     CT 1993          29.60172            3
4     NJ 1995          29.27725            4
5     CT 1992          28.63475            5
6     MA 1995          28.05051            6
7     NJ 1994          27.88522            7
8     NY 1995          27.72108            8
9     NJ 1993          27.10118            9
10    MD 1995          26.89587           10

【讨论】：