按因子选择数据框中的第 n 个元素答案

【问题标题】：Select nth element in data frame by factor按因子选择数据框中的第 n 个元素
【发布时间】：2023-03-14 05:00:01
【问题描述】：

我有一个带有文本列name 和因子city 的数据框。它按字母顺序首先由city 排序，然后是name。现在我需要得到一个数据框，它在每个city 中只包含第 n 个元素，保持这个顺序。没有循环怎么能漂亮地完成呢？

我有：

name    city
John    Atlanta
Josh    Atlanta
Matt    Atlanta
Bob     Boston
Kate    Boston
Lily    Boston
Matt    Boston

我想要一个函数，它通过city返回第n个元素，即，如果它是第3个，那么：

name    city
Matt    Atlanta
Lily    Boston

如果超出所选city 的范围，即对于第4 个，它应该为name 返回NULL：

name    city
NULL    Atlanta
Matt    Boston

请只使用基础 R 吗？

【问题讨论】：

你能举一个可重现的例子吗？比如说，展示一个与您拥有的类似的简短示例数据框，另一个展示您希望它变成什么？
with plyr: ddply(yourdata, .(city), function(x, n) x[n,], n=10) 但是如果您选择的n 大于城市的条目数怎么办？
这可以使用 dplyr 完成吗？

标签： r

【解决方案1】：

在基础 R 中使用 by:

设置一些测试数据，包括一个额外的超出范围的值：

test <- read.table(text="name    city
John    Atlanta
Josh    Atlanta
Matt    Atlanta
Bob     Boston
Kate    Boston
Lily    Boston
Matt    Boston
Bob     Seattle
Kate    Seattle",header=TRUE)

获取每个城市的第 3 项：

do.call(rbind,by(test,test$city,function(x) x[3,]))

结果：

        name    city
Atlanta Matt Atlanta
Boston  Lily  Boston
Seattle <NA>    <NA>

为了得到你想要的，这里有一个小函数：

nthrow <- function(dset,splitvar,n) {
    result <- do.call(rbind,by(dset,dset[splitvar],function(x) x[n,]))
    result[,splitvar][is.na(result[,splitvar])] <- row.names(result)[is.na(result[,splitvar])]
    row.names(result) <- NULL
    return(result)
}

这样称呼：

nthrow(test,"city",3)

结果：

  name    city
1 Matt Atlanta
2 Lily  Boston
3 <NA> Seattle

【讨论】：

打败了我。 @sashkello 请尽量在您最初的问题中尽可能具体，尤其是当使用额外的包是不可能的时，因为 R 的大部分内容是建立在用户贡献的特性之上的。

【解决方案2】：

data.table 解决方案

library(data.table)
DT <- data.table(test)

# return all columns from the subset data.table
n <- 4
DT[,.SD[n,] ,by = city]
##      city name
## 1: Atlanta   NA
## 2:  Boston Matt
## 3: Seattle   NA

# if you just want the nth element of `name` 
# (excluding other columns that might be there)
# any of the following would work

DT[,.SD[n,] ,by = city, .SDcols = 'name']


DT[, .SD[n, list(name)], by = city]


DT[, list(name = name[n]), by = city]

【讨论】：

selectedCol = "city", step= 4 , DT[,.SD[seq(1,.N,by=step),] ,by = selectec_Col] 即使我不明白也可以工作是好是坏

【解决方案3】：

您可以为此使用plyr：

dat <- structure(list(name = c("John", "Josh", "Matt", "Bob", "Kate",

"Lily", "Matt"), city = c("Atlanta", "Atlanta", "Atlanta", "Boston", "波士顿", "波士顿", "波士顿")), .Names = c("name", "city"), class= "data.frame", row.names = c(NA, -7L))

library(plyr)

ddply(dat, .(city), function(x, n) x[n,], n=3)

> ddply(dat, .(city), function(x, n) x[n,], n=3)
  name    city
1 Matt Atlanta
2 Lily  Boston
> ddply(dat, .(city), function(x, n) x[n,], n=4)
  name   city
1 <NA>   <NA>
2 Matt Boston
>

使用基本 R 或 data.table 或 sqldf 还有很多其他选项...

【讨论】：