Python：如何从给定的键中获取最接近的键？答案

【问题标题】：Python: How to get the closest key from the given key?Python：如何从给定的键中获取最接近的键？
【发布时间】：2016-10-17 11:52:56
【问题描述】：

给定一本字典：

sample = {
    '123': 'Foo', 
    '456': 'Bar', 
    '789': 'Hello', 
    '-111': 'World'
}

从字典中获取最接近（或更少）键的最有效方式（方法和/或数据结构）是什么？

注意：
1. 即使key是字符串，比较也应该是数值。
2. 键可以是“负数”。

例子：

get_nearest_less_element(sample, '456') # returns 'Bar'
get_nearest_less_element(sample, '235') # returns 'Foo'
get_nearest_less_element(sample, '455') # returns 'Foo'
get_nearest_less_element(sample, '999') # returns 'Hello'
get_nearest_less_element(sample, '0') # returns 'World'
get_nearest_less_element(sample, '-110') # returns 'World'
get_nearest_less_element(sample, '-999') # should return an error since it is beyond the lower bound

其他问题：
给定相同的数据集，sorted OrderedDict 或 List of Tuples 或任何其他 python 数据结构会是更好的方法吗？

【问题讨论】：

请澄清您的问题
应该'1000' 得到'Foo' 还是'Hello'？
例如，"500" 应该被认为高于"1000"（字符串比较）还是低于（数值比较）？
这本词典大多是红鲱鱼。考虑一个元组列表。如何找到“最接近”的这种元组？我将首先使用键功能按abs(int(targetvalue) - int(itemvalue)) 或某些变体对数据进行排序，然后选择第一个..
比较应该是数字。

标签： python python-2.7

【解决方案1】：

def get_nearest_less_element(d, k):
    k = int(k)
    return d[str(max(key for key in map(int, d.keys()) if key <= k))]

编辑以使用@Paul Hankin 的代码进行更新，但使用<= 我不确定它是否需要一个分支。将所有键转换为数字，找到小于或等于 k 的键，得到最大的一个 - 如果 k 在那里你会得到它，否则你会得到下一个最大的 - 转换回字符串并在字典。

测试：https://repl.it/C2dN/0

我不知道这是否是最有效的想法；由于您获得的字典是无序的，因此您必须遍历每个元素，因为它们中的任何一个都可能是下一个最大的，并且由于您需要进行数字比较，因此您必须将它们全部转换为整数。在我看来，任何其他结构都会花费更多的初始化成本，因为您必须先检查每个项目才能将其放入您的结构中。

但这取决于您的用例 - 如果 k 很可能在字典中，则将我的代码更改为具有 if k in d: return d[k] else: ... 分支是有意义的，因为在这种情况下不执行生成器表达式会更快。如果它很可能不在字典中，那将无济于事。

对它们的键进行排序的伪代码（未经测试）版本如下 - 使用一次会更慢，但查询很多时可能会更快：

# cache to store sorted keys between function calls
# nb. you will have to invalidate this cache (reset to []) 
# when you get a new dictionary
sorted_keys = []

def get_nearest_less_element(d, k):
    if k in d:               # quick return if key is in dict
        return d[k]
    else:

        # costly sort of the keys, only do this once
        if not sorted_keys:
            sorted_keys = sorted(int(key) for key in d.keys())

        # quick run through the sorted key list up
        # to latest item less than k
        k = int(k)
        nearest = sorted_keys[0]
        for item in sorted_keys:
            if item < k:
                nearest = item
            else:
                break

         return d[str(item)]

【讨论】：

第二个分支为什么不max(key for k in d.keys() if int(key) < int(k))？与您的 O(N) 空间 O(N log N) 时间相比，这是 O(1) 空间，O(N) 时间。
@PaulHankin 我正在重新考虑排序的必要性，但您的解决方案比我想象的还要简洁。我认为它可以是一个单一的回报，而不是if。
使用相同的算法，排序的 OrderedDict 或元组列表会更快吗？
OrderedDict 保持您向其中添加内容的顺序，它不会将它们排序为顺序。这根本没有帮助。我可以看到的真正交易是排序列表制作起来会更慢，查询起来会更快。取决于您是要大量查询相同的数据，还是要大量从 redis 中获取新字典并稍微查询一下。

【解决方案2】：

如果键存在，则以下模块返回值，否则它会在小于输入键的键列表中找到最大键。

def get_nearest_less_element(sample,key):
  if key in sample:
    return sample[key]
  else:
    return sample[str(max(x for x in sample.keys() if int(x) < int(key)))]

print get_nearest_less_element(sample, '456')
print get_nearest_less_element(sample, '235')
print get_nearest_less_element(sample, '455')
print get_nearest_less_element(sample, '999')

输出：

酒吧

富

富

你好

编辑： 根据保罗的评论编辑了答案。

【讨论】：

我刚刚对一个现已删除的答案发表了同样的评论，但没有必要使用 sort 来查找集合的最大值。 max(x for x in sample.keys() if int(x) < int(key)) 是 O(1) 空间，O(N) 时间，而 sorted 使用 O(N) 空间，O(N log N) 时间。
@PaulHankin，当然。我将合并建议并编辑答案。

【解决方案3】：

这里有一个解决方案。根据数值比较找到最接近的键：

sample = {'123': 'Foo', '456': 'Bar', '789': 'Hello'}

def get_nearest_less_element(inpDict, targetNum):
    diff = 2**32 - 1 # Very big number.
    currentKey = None
    for i in sample.keys():
        newDiff = abs(int(i) - targetNum)
        if newDiff < diff:
            currentKey = i
            diff = newDiff
    return inpDict[currentKey]

print(get_nearest_less_element(sample, 500))
# Prints Bar

这只是通过字典的一个循环，因此在 O(n) 时间和 O(1) 额外空间中运行。

【讨论】：

啊，我的错，这只是得到最近的，而不是下限的最接近的。
您可以通过添加if not diff or newDiff < diff: 并在开始时设置diff=None 来加快速度。

【解决方案4】：

我是这样做的：

def get_nearest_less_element(sample, key):
    try:
        if key not in sample:
            candidates = []
            for keys in sample:
                if int(keys) < int(key):
                    candidates.append(keys)
            return sample[max(candidates)]
        return sample[key]
    except ValueError:
        print("key is beyond lower bounds")

【讨论】：

【解决方案5】：

如果您只创建或更新样本一次或不频繁，但重复查找值，则在 O(n log n) 时间内预先计算排序数字列表是最有效的。那么整个字典就不需要扫描了；二进制搜索提供 O(log n) 访问权限。有一个 python 库模块函数，bisect。

from bisect import bisect

def nearest_index(sorted_keys, elem):
   idx = bisect(sorted_keys, elem)
   if idx >= len(sorted_keys):
     idx = len(sorted_keys) - 1
   elif idx > 0:
     # find closest of the two neighbors
     if elem <= (sorted_keys[idx-1] + sorted_keys[idx])/2.0:
       idx -= 1
   return idx

sample = {'123': 'Foo', '456': 'Bar', '789': 'Hello'}
sorted_keys = sorted(int(k) for k in sample.keys())

def get_nearest_element(sample, sorted_keys, elem):
  elem_int = int(elem)
  idx_nearest = nearest_index(sorted_keys, elem_int)
  return sample[str(sorted_keys[idx_nearest])]

for elem in ['456', '235', '455', '999']:
  print get_nearest_element(sample, sorted_keys, elem)

【讨论】：

【解决方案6】：

给定您的数据集，就设置和查找时间复杂度而言，最有效的数据结构是binary search tree，它为您提供 O(n log n) 设置和 O(log n)查找时间复杂度为 O(n) 空间复杂度。

标准 BST 算法不包括您的两个特殊约束（据我了解）

返回最大键的值
将搜索空间限制在地图中的最小值和最大值之间

这是一个基于this implementation的BST实现：

class Node(object):

    def __init__(self, key, value, parent):
        self.left = None
        self.right = None
        self.value = value
        self.key = key
        self.parent = parent

    def __str__(self):
        return ":".join(map(str, (self.key, self.value)))



class BinarySearchTree(object):

    def __init__(self):
        self.root = None


    def getRoot(self):
        return self.root


    def __setitem__(self, key, value):
        if(self.root == None):
            self.root = Node(key, value, None)
        else:
            self._set(key, value, self.root)


    def _set(self, key, value, node):
        if key == node.key:
            node.value = value
        elif key < node.key:
            if(node.left != None):
                self._set(key, value, node.left)
            else:
                node.left = Node(key, value, node)
        else:
            if(node.right != None):
                self._set(key, value, node.right)
            else:
                node.right = Node(key, value, node)


    def __contains__(self, key):
        return self._get(key) != None


    def __getitem__(self, key):
        if(self.root != None):
            return self._get(key, self.root)
        else:
            return None


    def _get(self, key, node):
        if key == node.key:
            return node.value
        elif key < node.key and node.left != None:
            return self._get(key, node.left)
        elif key > node.key and node.right != None:
            return self._get(key, node.right)

这是一个满足要求 1 的子类：

class FuzzySearchTree(BinarySearchTree):

    def _get(self, key, node):
        if key == node.key:
            return node.value
        elif key < node.key:
            if node.left != None:
                return self._get(key, node.left)
            else:
                return self._checkMin(key, node)
        else:
            if node.right != None:
                return self._get(key, node.right)
            else:
                return node.value # found the closest match that is larger


    def _checkMin(self, key, node):
        return node.value

要满足要求 2，您需要跟踪树中的最小值。您可能应该通过在插入时跟踪最小值来做到这一点，但这是一种不同的方法。这种方法效率不高，但应该还是 o(3 log n) == O(log n)，所以还不错。如果你真的不需要这个，我不会打扰它。

class MinBoundedFuzzySearchTree(FuzzySearchTree):

    def _checkMin(self, key, node):
        # Unless the value is lower than the minimum value in the tree # Not advised
        next = node.parent
        while next.parent != None:
            next = next.parent # Go up the tree to the top
        next = next.left
        while next.left != None:
            next = next.left # Go down the tree to the left
        if next.key > key:
            return None # outside the the range of the tree

        # Return the max value less than the key, which is by definition the parent
        return node.parent.value

这里有一些伪测试：

tree = BinarySearchTree()
tree[123] = 'Foo'
tree[456] = 'Bar'
tree[789] = 'Hello'
tree[-111] = 'World'

print "BST(456) == 'Bar': " + str(tree[456])
print "BST(235) == None: " + str(tree[235])
print "BST(455) == None: " + str(tree[455])
print "BST(999) == None: " + str(tree[999])
print "BST(0) == None: " + str(tree[0])
print "BST(123) == 'Foo': " + str(tree[123])
print "BST(-110) == None: " + str(tree[-110])
print "BST(-999) == None: " + str(tree[-999])

tree = FuzzySearchTree()
tree[123] = 'Foo'
tree[456] = 'Bar'
tree[789] = 'Hello'
tree[-111] = 'World'

print
print "FST(456) == 'Bar': " + str(tree[456])
print "FST(235) == 'Foo': " + str(tree[235])
print "FST(455) == 'Foo': " + str(tree[455])
print "FST(999) == 'Hello': " + str(tree[999])
print "FST(0) == 'World': " + str(tree[0])
print "FST(123) == 'Foo': " + str(tree[123])
print "FST(-110) == 'World': " + str(tree[-110])
print "FST(-999) == 'World': " + str(tree[-999])


tree = MinBoundedFuzzySearchTree()
tree[123] = 'Foo'
tree[456] = 'Bar'
tree[789] = 'Hello'
tree[-111] = 'World'

print
print "MBFST(456) == 'Bar': " + str(tree[456])
print "MBFST(235) == 'Foo': " + str(tree[235])
print "MBFST(455) == 'Foo': " + str(tree[455])
print "MBFST(999) == 'Hello': " + str(tree[999])
print "MBFST(0) == 'World': " + str(tree[0])
print "MBFST(123) == 'Foo': " + str(tree[123])
print "MBFST(-110) == 'World': " + str(tree[-110])
print "MBFST(-999) == None: " + str(tree[-999])

这是打印的内容：

"""
BST(456) == 'Bar': Bar
BST(235) == None: None
BST(455) == None: None
BST(999) == None: None
BST(0) == None: None
BST(123) == 'Foo': Foo
BST(-110) == None: None
BST(-999) == None: None

FST(456) == 'Bar': Bar
FST(235) == 'Foo': Foo
FST(455) == 'Foo': Foo
FST(999) == 'Hello': Hello
FST(0) == 'World': World
FST(123) == 'Foo': Foo
FST(-110) == 'World': World
FST(-999) == 'World': Foo

MBFST(456) == 'Bar': Bar
MBFST(235) == 'Foo': Foo
MBFST(455) == 'Foo': Foo
MBFST(999) == 'Hello': Hello
MBFST(0) == 'World': World
MBFST(123) == 'Foo': Foo
MBFST(-110) == 'World': World
MBFST(-999) == None: None
"""

【讨论】：

比我的排序列表答案更好的查找效率。