为什么我的二分搜索实现效率很低？答案

【问题标题】：Why is my implementation of binary search very inefficient?为什么我的二分搜索实现效率很低？
【发布时间】：2016-10-24 08:33:03
【问题描述】：

我正在做一个 Python 练习，从给定的排序 wordlist 中搜索一个 word，包含超过 100,000 个单词。

使用Python bisect module中的bisect_left时效率很高，但使用自己创建的二进制方法效率很低。谁能解释一下为什么？

这是使用 Python bisect 模块的搜索方法：

def in_bisect(word_list, word):
    """Checks whether a word is in a list using bisection search.

    Precondition: the words in the list are sorted

    word_list: list of strings
    word: string
    """
    i = bisect_left(word_list, word)
    if i != len(word_list) and word_list[i] == word:
        return True
    else:
        return False

我的实现真的非常低效（不知道为什么）：

def my_bisect(wordlist,word):
    """search the given word in a wordlist using
    bisection search, also known as binary search
    """
    if len(wordlist) == 0:
        return False
    if len(wordlist) == 1:
        if wordlist[0] == word:
            return True
        else:
            return False

    if word in wordlist[len(wordlist)/2:]:
        return True

    return my_bisect(wordlist[len(wordlist)/2:],word)

【问题讨论】：

因为您实际上并没有使用二分搜索？
@jonrsharpe，我尝试实现二分查找，我搜索开始的一半，如果不是开始的一半，我搜索另一半
这里的问题是您在每个级别上都制作了列表的副本，这将使您从执行二分搜索中获得的任何好处都相形见绌。尝试仅使用索引来区分要搜索的部分。
此外，您正在执行“if word in xxx”，它将进行循环和比较。这根本不是二分搜索。
if word in wordlist[len(wordlist)/2:] 将使 Python 搜索您的 wordlist 的一半，这完全违背了编写二进制搜索的目的。请注意，二进制搜索仅适用于排序列表。

标签： python recursion binary-search

【解决方案1】：

if word in wordlist[len(wordlist)/2:]

将使 Python 搜索一半的wordlist，这有点违背了编写二进制搜索的初衷。此外，您没有正确地将列表分成两半。二分搜索的策略是每一步将搜索空间减半，然后只对你的word可能所在的那一半应用相同的策略。为了知道哪一半是正确的搜索，对wordlist 进行排序至关重要。这是一个示例实现，它跟踪验证 word 是否在 wordlist 中所需的调用次数。

import random

numcalls = 0
def bs(wordlist, word):
    # increment numcalls
    print('wordlist',wordlist)
    global numcalls
    numcalls += 1

    # base cases
    if not wordlist:
        return False
    length = len(wordlist)
    if length == 1:
        return wordlist[0] == word

    # split the list in half
    mid = int(length/2) # mid index
    leftlist = wordlist[:mid]
    rightlist = wordlist[mid:]
    print('leftlist',leftlist)
    print('rightlist',rightlist)
    print()

    # recursion
    if word < rightlist[0]:
        return bs(leftlist, word) # word can only be in left list
    return bs(rightlist, word) # word can only be in right list

alphabet = 'abcdefghijklmnopqrstuvwxyz'
wl = sorted(random.sample(alphabet, 10))
print(bs(wl, 'm'))
print(numcalls)

我包含了一些print 语句，以便您了解发生了什么。这是两个示例输出。第一：word在wordlist中：

wordlist ['b', 'c', 'g', 'i', 'l', 'm', 'n', 'r', 's', 'v']
leftlist ['b', 'c', 'g', 'i', 'l']
rightlist ['m', 'n', 'r', 's', 'v']

wordlist ['m', 'n', 'r', 's', 'v']
leftlist ['m', 'n']
rightlist ['r', 's', 'v']

wordlist ['m', 'n']
leftlist ['m']
rightlist ['n']

wordlist ['m']
True
4

第二：word不在wordlist中：

wordlist ['a', 'c', 'd', 'e', 'g', 'l', 'o', 'q', 't', 'x']
leftlist ['a', 'c', 'd', 'e', 'g']
rightlist ['l', 'o', 'q', 't', 'x']

wordlist ['l', 'o', 'q', 't', 'x']
leftlist ['l', 'o']
rightlist ['q', 't', 'x']

wordlist ['l', 'o']
leftlist ['l']
rightlist ['o']

wordlist ['l']
False
4

请注意，如果您将单词列表的大小加倍，即使用

wl = sorted(random.sample(alphabet, 20))

numcalls 平均只比长度为 10 的 wordlist 高 1，因为 wordlist 必须再次分成两半。

【讨论】：

我把你的bs()函数复制到我的代码中，用它来做二分查找，速度还是比使用库函数慢很多，不知道为什么？
@bean 我的代码在每次函数调用时都会创建新列表，您可以通过调整函数以仅查看wordlist 的某些索引而不是创建leftlist 和rightlist 来避免这种情况。
是的，我确实使用 slice [len(wordlist)/2:] 只查看单词列表的某些索引，但仍然不起作用。
@bean slicing 正在创建一个新列表
谢谢你能告诉我如何只查看单词表的某些索引吗？我想不出一种方法来做到这一点。

【解决方案2】：

简单地搜索单词是否在单词列表中（python 2.7）：

def bisect_fun(listfromfile, wordtosearch):
    bi = bisect.bisect_left(listfromfile, wordtosearch)
    if listfromfile[bi] == wordtosearch:
        return listfromfile[bi], bi

【讨论】：