递归二进制搜索字符串 - C++答案

【问题标题】：Recursive binary search for a string - C++递归二进制搜索字符串 - C++
【发布时间】：2018-07-30 16:40:45
【问题描述】：

我正在尝试实现函数findMatchesInDict，它试图查看一个单词是否与预先排序的字典中的任何单词匹配。以下是我目前的实现：

void findMatchesInDict(string word, int start, const string dict[], int end, string results[], int& totalResults)
{
    // initial start = 0 index
    // initial end = last index of dict array

    int middle = start + (end - start) / 2;
    if (end < start)
        return;

    if (word == dict[middle]) // if we found a match
        storeUniqueMatches(word, 0, results, totalResults); 
    else if (word < dict[middle])
        findMatchesInDict(word, start, dict, middle - 1, results, totalResults);
    else
        findMatchesInDict(word, middle + 1, dict, end, results, totalResults);
}

storeUniqueMatches 函数正常工作（这只是将匹配的单词存储到 results 数组中，确保不会存储重复的单词。

该功能只会匹配字典中的选定单词，而不匹配其他单词。

关于为什么这可能无法正常工作的任何想法？

作为参考，此实现有效，但效率极低并导致堆栈溢出错误。

void findMatchesInDict(string word, int start, const string dict[], int end, string results[], int& totalResults)
{
    if (start > end) 
        return;
    if (word == dict[start]) // if we found a match
        storeUniqueMatches(word, 0, results, totalResults);

    findMatchesInDict(word, start + 1, dict, size, results, totalResults);
}

【问题讨论】：

@FeiXiang 刚刚编辑添加了这个。它不会匹配某些单词，但确实适用于其他单词。我的其他算法适用于字典中的所有单词。
尝试使用调试器。你的意思是如果你给它某些词，算法就找不到这个词？你确定数组是按字典顺序排序的吗？尝试创建一个minimal reproducible example，我们可以使用它来重现问题。
二进制搜索适用于某些值但不适用于其他值 - 听起来很熟悉。根据我的经验，该算法可能存在错误或更好的不准确性，即某些边界情况未正确处理，错误 1 或类似的情况。我会尝试为不太大的字典大小找到一个不起作用的示例，然后逐步调试它以查看哪里出错了。为了你的运气，即使是 1000 字典。条目，二进制搜索应该在 10 次递归后终止 - ld 的力量（对数对数），你知道吗？ ;-)
顺便说一句。你确保正确排序了吗？您是否确保排序提供与二进制搜索中假定的完全相同的顺序？我的另一个经验：如果您在错误的行中搜索，很难在代码中找到错误... ;-)

标签： c++ recursion binary-search

【解决方案1】：

我仍然相信 OP 犯了 1 个错误。

我强烈怀疑

findMatchesInDict(word, start, dict, middle - 1, results, totalResults);

应该是

findMatchesInDict(word, start, dict, middle, results, totalResults);

我制作了自己的小样本。（因此，我重新设计了代码，因为我对 OP 的做法感到不走运。）

#include <iostream>
#include <string>

size_t find(const std::string &word, const std::string dict[], size_t i0, size_t size)
{
  if (!size) return (size_t)-1; // bail out with invalid index
  const size_t i = i0 + size / 2;
  return word == dict[i]
    ? i
    : word < dict[i]
      ? find(word, dict, i0, i - i0)
      : find(word, dict, i + 1, i0 + size - (i + 1));
}

int main()
{
  const std::string dict[] = {
    "Ada", "BASIC", "C", "C++",
    "D", "Haskell", "INTERCAL", "Modula2",
    "Oberon", "Pascal", "Scala", "Scratch",
    "Vala"
  };
  const size_t sizeDict = sizeof dict / sizeof *dict;
  unsigned nErrors = 0;
  // brute force tests to find something what is in
  for (size_t n = 1; n <= sizeDict; ++n) {
    for (size_t i = 0; i < n; ++i) {
      if (find(dict[i], dict, 0, n) >= n) {
        std::cerr << "ALERT! Unable to find entry " << i << " in " << n << " entries!\n";
        ++nErrors;
      }
    }
  }
  // brute force tests to find something what is not in
  for (size_t n = 1; n <= sizeDict; ++n) {
    if (find("", dict, 0, n) < n) {
      std::cerr << "ALERT! Able to find entry '' in " << n << " entries!\n";
      ++nErrors;
    }
    for (size_t i = 0; i < n; ++i) {
      if (find(dict[i] + " + Assembler", dict, 0, n) < n) {
        std::cerr << "ALERT! Able to find entry '" << dict[i] << " + Assembler' in " << n << " entries!\n";
        ++nErrors;
      }
    }
  }
  // report
  if (!nErrors) std::cout << "All tests passed OK.\n";
  else std::cerr << nErrors << " tests failed!\n";
  // done
  return nErrors > 0;
}

Live Demo on coliru

这段代码大部分是暴力测试代码：

测试从 1 到大小为 dict 的每个长度。对于每个长度，搜索dict 的任何条目。
测试从 1 到大小为 dict 的每个长度。对于每个长度，都会测试空字符串（在任何其他条目之前）以及任何带有修改的条目。（修改允许它将在未修改条目与其后继条目之间或在最后一个条目之后。）

输出：

All tests passed OK.

一切顺利。

然后我换了

find(word, dict, i0, i - i0)

与

find(word, dict, i0, i - i0 > 0 ? i - i0 - 1 : 0)

类似于（在我看来）OP 的代码有什么问题。

输出：

ALERT! Unable to find entry 0 in 2 entries!
ALERT! Unable to find entry 0 in 3 entries!
ALERT! Unable to find entry 1 in 4 entries!
ALERT! Unable to find entry 1 in 5 entries!
ALERT! Unable to find entry 3 in 5 entries!
ALERT! Unable to find entry 0 in 6 entries!
ALERT! Unable to find entry 2 in 6 entries!
ALERT! Unable to find entry 4 in 6 entries!
ALERT! Unable to find entry 0 in 7 entries!
ALERT! Unable to find entry 2 in 7 entries!
ALERT! Unable to find entry 4 in 7 entries!
ALERT! Unable to find entry 0 in 8 entries!
ALERT! Unable to find entry 3 in 8 entries!
ALERT! Unable to find entry 5 in 8 entries!
ALERT! Unable to find entry 0 in 9 entries!
ALERT! Unable to find entry 3 in 9 entries!
ALERT! Unable to find entry 6 in 9 entries!
ALERT! Unable to find entry 1 in 10 entries!
ALERT! Unable to find entry 4 in 10 entries!
ALERT! Unable to find entry 7 in 10 entries!
ALERT! Unable to find entry 1 in 11 entries!
ALERT! Unable to find entry 4 in 11 entries!
ALERT! Unable to find entry 7 in 11 entries!
ALERT! Unable to find entry 9 in 11 entries!
ALERT! Unable to find entry 1 in 12 entries!
ALERT! Unable to find entry 3 in 12 entries!
ALERT! Unable to find entry 5 in 12 entries!
ALERT! Unable to find entry 8 in 12 entries!
ALERT! Unable to find entry 10 in 12 entries!
ALERT! Unable to find entry 1 in 13 entries!
ALERT! Unable to find entry 3 in 13 entries!
ALERT! Unable to find entry 5 in 13 entries!
ALERT! Unable to find entry 7 in 13 entries!
ALERT! Unable to find entry 9 in 13 entries!
ALERT! Unable to find entry 11 in 13 entries!
35 tests failed!

嗯。实际上，这并不能证明OP的代码。

但是，这表明

“off by 1”可以从根本上破坏二分搜索。
如何设计蛮力测试来发现此类错误。

因此，这有望帮助 OP 自己找到算法中的错误（这实际上对他来说更有价值）。

【讨论】：