来自两个以上字符串的最长公共子字符串 - C++答案

【问题标题】：Longest common substring from more than two strings - C++来自两个以上字符串的最长公共子字符串 - C++
【发布时间】：2025-11-26 19:05:02
【问题描述】：

我需要从 C++ 中的一组文件名中计算最长的公共子字符串。

准确地说，我有一个 std::list 的 std::strings （或 QT 等价物，也可以）

char const *x[] = {"FirstFileWord.xls", "SecondFileBlue.xls", "ThirdFileWhite.xls", "ForthFileGreen.xls"};
std::list<std::string> files(x, x + sizeof(x) / sizeof(*x));

我需要计算所有字符串的 n 个不同的最长公共子字符串，在这种情况下，例如对于 n=2

 "File" and ".xls"

如果我可以计算出最长的公共子序列，我可以将它剪掉并再次运行算法以获得第二长的，所以基本上可以归结为：

是否有（参考？）实现来计算 std::strings 的 std::list 的 LCS？

这不是一个好的答案，而是我拥有的一个肮脏的解决方案 - 对 QUrls 的 QList 进行暴力破解，只取最后一个“/”之后的部分。我很想用“正确”的代码替换它。

（我发现了http://www.icir.org/christian/libstree/ - 这会很有帮助，但我无法在我的机器上编译它。也许有人用过这个？）

QString SubstringMatching::getMatchPattern(QList<QUrl> urls)
    {
    QString a;

    int foundPosition = -1;
    int foundLength = -1;
    for (int i=urls.first().toString().lastIndexOf("/")+1; i<urls.first().toString().length(); i++)
    {
        bool hit=true;
        int xj;
        for (int j=0; j<urls.first().toString().length()-i+1; j++ ) // try to match from position i up to the end of the string :: test character at pos. (i+j)
        {
            if (!hit) break;

            QString firstString = urls.first().toString().right( urls.first().toString().length()-i ).left( j ); // this needs to match all k strings
            //qDebug() << "SEARCH " << firstString;

            for (int k=1; k<urls.length(); k++) // test all other strings, k = test string number
            {
                if (!hit) break;

                //qDebug() << " IN  " << urls.at(k).toString().right(urls.at(k).toString().length() - urls.at(k).toString().lastIndexOf("/")+1);
                //qDebug() << " RES " << urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1);
                if (urls.at(k).toString().indexOf(firstString, urls.at(k).toString().lastIndexOf("/")+1)<0) {
                    xj = j;
                    //qDebug() << "HIT LENGTH " << xj-1 << " : " << firstString;
                    hit = false;
                }
            }

        }
        if (hit) xj = urls.first().toString().length()-i+1; // hit up to the end of the string
        if ((xj-2)>foundLength) // have longer match than existing, j=1 is match length
        {
            foundPosition = i; // at the current position
            foundLength = xj-1;
            //qDebug() << "Found at " << i << " length " << foundLength;
        }
    }

    a = urls.first().toString().right( urls.first().toString().length()-foundPosition ).left( foundLength );
    //qDebug() << a;
    return a;
}

【问题讨论】：

这可能有用。 *.com/questions/2418504/…
我已经点击了数百个类似的问题但没有找到答案，包括上面的问题和*.com/questions/10248728/…。我得到的最接近的是homepage.virgin.net/cdm.henderson/site/cpp/lcs/index.htm，但那是用于子序列，而不是子字符串。
这是一个不平凡的问题。需要进行详尽的搜索。（如果我的第一眼保持）
应该有一个使用后缀树的好解决方案，至少如果我理解 libstree 的库使用示例是正确的 - icir.org/christian/libstree/manual/c39.html。不幸的是，它不能编译并且对我的需要来说太复杂了。
您应该将所有匹配的子字符串保存在vector 中。对于每个字符串，每个字符都应该与其他字符进行比较。最小长度的字符串可以作为起点的参考，而int Find(const char* str, char ch) 和void Sort(char* words[]) 等辅助函数将非常有用。

标签： c++ string sequence matching longest-substring

【解决方案1】：

如果你说后缀树太重或不切实际，以下相当简单的蛮力方法可能足以满足您的应用程序。

我假设不同的子字符串应该是不重叠的，并且是从从左到右。

即使有这些假设，也不需要一个唯一的集合包含一组字符串的“N 个不同的最长公共子字符串”。无论 N 是什么，可能有超过 N 个不同的公共子字符串，它们的最大值都相同长度和其中 N 的任何选择都是任意的。因此解决方案找到最长的不同公共的至多 N *sets* 所有相同长度的子串都是一组。

算法如下：

Q 是长度的目标配额。
字符串是字符串的问题集。
Results 是一个初始为空的多重映射，它将长度映射到一组字符串， Results[l] 是长度为 l
N，最初为 0，是 Results 中表示的不同长度的数量
如果 Q 为 0 或 Strings 为空，则返回 Results
查找 Strings 中任何最短的成员；保留它的副本 S 并将其删除来自字符串。我们继续比较 S 的子串和那些 Strings 因为所有的公共子字符串 {Strings, S} 必须是 S.
迭代生成 S 的所有子字符串，最长优先，使用明显的嵌套循环由偏移量和长度控制。对于每个子字符串 ss S:
- 如果 ss 不是 Strings 的公共子字符串，则下一步。
- 迭代 Results[l] for l >= ss 的长度直到结束 Results 或直到 ss 被发现是被检查的子字符串结果。在后一种情况下，ss 与结果没有区别在手，所以接下来。
- ss 是常见的子字符串，不同于现有的任何子字符串。迭代 Results[l] for l ss 的长度，删除每个结果 ss 的子字符串，因为所有这些都比 ss 短且不明显从中。 ss 现在是一个公共子字符串，不同于现有的任何子字符串，并且剩下的所有其他人都与 ss 不同。
- 对于l = ss的长度，检查Results[l]是否存在，即是否手头有任何与 ss 长度相同的结果。如果没有，请调用 NewLength 条件。
- 还要检查 N == Q，即我们已经达到了 distinct 的目标配额长度。如果 NewLength 获得并且 N == Q，则称其为 StickOrRaise 条件。
- 如果 StickOrRaise 获得然后比较 ss 的长度与 l = 长度最短的结果在手。如果 ss 比 l 短那么它对我们的配额来说太短了，所以接下来。如果 ss 比 l 长那么所有最短的结果都将被淘汰以支持 ss，因此删除 结果[l] 和递减N。
- 将 ss 插入到 Results 中，以长度为关键字。
- 如果 NewLength 获得，则增加 N。
- 放弃对 S 的子字符串的内部迭代 ss 的相同偏移量，但更短，因为它们都不是不同的来自ss。
- 将外部迭代的S中的偏移量提前ss的长度，到下一个非重叠子字符串的开头。
返回结果。

这是一个实现解决方案并演示它的程序字符串列表：

#include <list>
#include <map>
#include <string>
#include <iostream>
#include <algorithm>

using namespace std;

// Get a non-const iterator to the shortest string in a list
list<string>::iterator shortest_of(list<string> & strings)
{
    auto where = strings.end();
    size_t min_len = size_t(-1);
    for (auto i = strings.begin(); i != strings.end(); ++i) {
        if (i->size() < min_len) {
            where = i;
            min_len = i->size();
        }
    }
    return where;
}

// Say whether a string is a common substring of a list of strings
bool 
is_common_substring_of(
    string const & candidate, list<string> const & strings)
{
    for (string const & s : strings) {
        if (s.find(candidate) == string::npos) {
            return false;
        }
    }
    return true;
}


/* Get a multimap whose keys are the at-most `quota` greatest 
    lengths of common substrings of the list of strings `strings`, each key 
    multi-mapped to the set of common substrings of that length.
*/
multimap<size_t,string> 
n_longest_common_substring_sets(list<string> & strings, unsigned quota)
{
    size_t nlengths = 0;
    multimap<size_t,string> results;
    if (quota == 0) {
        return results;
    }
    auto shortest_i = shortest_of(strings);
    if (shortest_i == strings.end()) {
        return results;
    }
    string shortest = *shortest_i;
    strings.erase(shortest_i);
    for ( size_t start = 0; start < shortest.size();) {
        size_t skip = 1;
        for (size_t len = shortest.size(); len > 0; --len) {
            string subs = shortest.substr(start,len);
            if (!is_common_substring_of(subs,strings)) {
                continue;
            }
            auto i = results.lower_bound(subs.size());
            for (   ;i != results.end() && 
                    i->second.find(subs) == string::npos; ++i) {}
            if (i != results.end()) {
                continue;
            }
            for (i = results.begin(); 
                    i != results.end() && i->first < subs.size(); ) {
                if (subs.find(i->second) != string::npos) {
                    i = results.erase(i);
                } else {
                    ++i;
                }
            }
            auto hint = results.lower_bound(subs.size());
            bool new_len = hint == results.end() || hint->first != subs.size();
            if (new_len && nlengths == quota) {
                size_t min_len = results.begin()->first;
                if (min_len > subs.size()) {
                    continue;
                }
                results.erase(min_len);
                --nlengths;
            }
            nlengths += new_len;
            results.emplace_hint(hint,subs.size(),subs);
            len = 1;
            skip = subs.size();
        }
        start += skip;
    }
    return results; 
}

// Testing ...

int main()
{
    list<string> strings{
        "OfBitWordFirstFileWordZ.xls", 
        "SecondZWordBitWordOfFileBlue.xls", 
        "ThirdFileZBitWordWhiteOfWord.xls", 
        "WordFourthWordFileBitGreenZOf.xls"};

    auto results = n_longest_common_substring_sets(strings,4);
    for (auto const & val : results) {
        cout << "length: " << val.first 
        << ", substring: " << val.second << endl;
    }
    return 0;
}

输出：

length: 1, substring: Z
length: 2, substring: Of
length: 3, substring: Bit
length: 4, substring: .xls
length: 4, substring: File
length: 4, substring: Word

（使用 gcc 4.8.1 构建）

【讨论】：