通过列表/向量中的索引选择最接近的 x 元素答案

【问题标题】：Choose closest x elements by index in a list/vector通过列表/向量中的索引选择最接近的 x 元素
【发布时间】：2018-10-11 10:04:45
【问题描述】：

如果我有一个向量，例如x <-c(1,2,3,4,5,6,7,8,9)，我想要一个函数 f 使得 f(vector,index,num) 它获取向量并给我num 与索引上的那个“最接近”的元素例子： f(x,3,4) = c(1,2,4,5) f(x,1,5) = c(2,3,4,5,6) f(x,8,3) = c(6,7,9)

由于还有一个问题，如果我们有一个奇数，我们需要选择是对称选择左侧还是右侧，让我们选择左侧（但右侧也可以）即f(x,4,5) = c(1,2,3,5,6) and f(x,7,3) = c(5,6,8)

我希望我的问题很清楚，感谢您的任何帮助/回复！

编辑：c(1:9) 的原始向量是任意的，该向量可以是一个字符串向量，也可以是一个长度为 1000 的向量，带有重复的随机数字等。

即c(1,7,4,2,3,7,2,6,234,56,8)

【问题讨论】：

您能告诉我们更多关于您的申请的信息吗？如果 x 始终是一个连续的整数范围，例如您的示例 1:9，我们可以提出一个封闭形式的解决方案。我们可以假设向量是有序的吗？没有重复？如果我们能找到一个简单的封闭形式，我看不出编写递归搜索的意义。
您好-我的错，向量可能是一堆字符串，例如c("a","b","c") 和任何顺序！出于简单，我只选择了 1:9
请不要选择像1:9这样简单的例子，你能举一个更难的例子吗？哦，当你的意思是“最接近”时，你的意思是“按索引最接近”，你不希望我们比较元素值
没错！抱歉，我应该选择不同的向量，我将编辑原始问题以反映这一点
看，如果num 是偶数，总是有一个封闭形式的解决方案：index - num/2 ... index + num/2，除非索引靠近向量的开始/结束。如果num 是奇怪的，你需要告诉我们如何打破关系。

标签： r vector indices closest

【解决方案1】：

num_closest_by_indices <- function(v, idx, num) {
  # Try the base case, where idx is not within (num/2) of the edge
  i <- abs(seq_along(x) - idx)
  i[idx] <- +Inf # sentinel

  # If there are not enough elements in the base case, incrementally add more
  for (cutoff_idx in seq(floor(num/2), num)) {
    if (sum(i <= cutoff_idx) >= num) {
      # This will add two extra indices every iteration. Strictly if we have an even length, we should add the leftmost one first and `continue`, to break ties towards the left.
      return(v[i <= cutoff_idx])
    }
  }
}

这里是这个算法的一个例子：我们按照期望的顺序排列索引，然后选择最低的num合法的：

> seq_along(x)
  1 2 3 4 5 6 7 8 9
> seq_along(x) - idx
  -2 -1  0  1  2  3  4  5  6
> i <- abs(seq_along(x) - idx)
   2  1  0  1  2  3  4  5  6
> i[idx] <- +Inf # sentinel to prevent us returning the element itself
   2   1 Inf   1   2   3   4   5   6

现在我们可以找到具有最小值的num 元素（任意打破平局，除非您有偏好（左））。我们的第一个猜测是所有索引 index 在开始/结束的(num/2) 之内，这可能还不够。

> i <= 2
  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE
> v[i <= 2]
  1 2 4 5

因此，调整@dash2 的代码来处理某些索引非法（非正数，或> 长度（x））的极端情况，即! %in% 1:L。那么min(elems) 将是我们无法选择的非法索引的数量，因此我们必须选择更多abs(min(elems))。

注意事项：

最后，代码通过三个分段案例处理起来更简单、更快捷。哇。
如果我们选择(num+1) 索引，然后在返回答案之前删除idx，它实际上似乎简化了事情。使用result[-idx] 将其删除。

【讨论】：

哇，谢谢大家的回答！看起来这个问题比我最初想象的要棘手（可能是为什么我有点挣扎哈哈）但解决方案看起来不错:)

【解决方案2】：

像这样：

f <- function (vec, elem, n) {
  elems <- seq(elem - ceiling(n/2), elem + floor(n/2))
  if (max(elems) > length(vec)) elems <- elems - (max(elems) - length(vec))
  if (elems[1] < 1) elems <- elems + (1 - elems[1])
  elems <- setdiff(elems, elem)
  vec[elems]
}

给出结果：

> f(1:9, 1, 5)
[1] 2 3 4 5 6
> f(1:9, 9, 5)
[1] 4 5 6 7 8
> f(1:9, 2, 5)
[1] 1 3 4 5 6
> f(1:9, 4, 5)
[1] 1 2 3 5 6
> f(1:9, 4, 4)
[1] 2 3 5 6
> f(1:9, 2, 4)
[1] 1 3 4 5
> f(1:9, 1, 4)
[1] 2 3 4 5
> f(1:9, 9, 4)
[1] 5 6 7 8

【讨论】：

在边缘情况下，其中一些索引将是非法的（负数或 > 长度）。所以你必须选择合法索引中的num。通过迭代或特殊情况。
已编辑。我以为原始海报很乐意在这种情况下抛出错误，没有注意到f(1:9, 1, 5)。
那么三个分段情况。

【解决方案3】：

首先使用变量参数x 启动一个函数，然后是引用table 和n

.nearest_n <- function(x, table, n) {

该算法假定table 是数字，没有任何重复，并且所有值都是有限的； n 必须小于或等于表的长度

    ## assert & setup
    stopifnot(
        is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
        n <= length(table)
    )

对表格进行排序，然后'clamp'最大值和最小值

    ## sort and clamp
    table <- c(-Inf, sort(table), Inf)
    len <- length(table)

在table 中找到x 出现的区间； findInterval() 使用高效搜索。使用区间索引作为初始的下索引，并为上索引加 1，确保保持在边界内。

    ## where to start?
    lower <- findInterval(x, table)
    upper <- min(lower + 1L, len)

通过比较上下索引距离与x 的距离，找到最近的n 邻居，记录最接近的值，并酌情增加上下索引并确保保持在边界内

    ## find
    nearest <- numeric(n)
    for (i in seq_len(n)) {
        if (abs(x - table[lower]) < abs(x - table[upper])) {
            nearest[i] = table[lower]
            lower = max(1L, lower - 1L)
        } else {
            nearest[i] = table[upper]
            upper = min(len, upper + 1L)
        }
    }

然后返回解并完成函数

    nearest
}

代码可能看起来很冗长，但实际上相对高效，因为对整个向量（sort()、findInterval()）的唯一操作在 R 中高效实现。

这种方法的一个特别的优点是它可以在它的第一个参数中进行向量化，计算使用 lower (use_lower = ...) 作为向量并使用 pmin() / pmax() 作为钳位的测试。

.nearest_n <- function(x, table, n) {
    ## assert & setup
    stopifnot(
        is.numeric(table), !anyDuplicated(table), all(is.finite(table)),
        n <= length(table)
    )

    ## sort and clamp
    table <- c(-Inf, sort(table), Inf)
    len <- length(table)

    ## where to start?
    lower <- findInterval(x, table)
    upper <- pmin(lower + 1L, len)

    ## find
    nearest <- matrix(0, nrow = length(x), ncol = n)
    for (i in seq_len(n)) {
        use_lower <- abs(x - table[lower]) < abs(x - table[upper])
        nearest[,i] <- ifelse(use_lower, table[lower], table[upper])
        lower[use_lower] <- pmax(1L, lower[use_lower] - 1L)
        upper[!use_lower] <- pmin(len, upper[!use_lower] + 1L)
    }

    # return
    nearest
}

例如

> set.seed(123)
> table <- sample(100, 10)
> sort(table)
 [1]  5 29 41 42 50 51 79 83 86 91
> .nearest_n(c(30, 20), table, 4)
     [,1] [,2] [,3] [,4]
[1,]   29   41   42   50
[2,]   29    5   41   42

通过获取任何参数并使用参考查找表table0 和其中的索引table1 将其强制转换为所需的形式来概括这一点

nearest_n <- function(x, table, n) {
    ## coerce to common form
    table0 <- sort(unique(c(x, table)))
    x <- match(x, table0)
    table1 <- match(table, table0)

    ## find nearest
    m <- .nearest_n(x, table1, n)

    ## result in original form
    matrix(table0[m], nrow = nrow(m))
}

举个例子……

> set.seed(123)
> table <- sample(c(letters, LETTERS), 30)
> nearest_n(c("M", "Z"), table, 5)
     [,1] [,2] [,3] [,4] [,5]
[1,] "o"  "L"  "O"  "l"  "P" 
[2,] "Z"  "z"  "Y"  "y"  "w"

【讨论】：