找到字符串模式的更好解决方案？答案

【问题标题】：Better solution of finding a pattern of a string?找到字符串模式的更好解决方案？
【发布时间】：2013-03-06 02:29:08
【问题描述】：

我正在尝试找到一种最佳方法来查找字符串模式并进行比较。例如，我有 s1 = "red blue blue red red yellow" 和 s2 = "abbaac"。这将匹配，因为它们具有相同的模式。

我的想法是遍历 s1 和 s2，使用向量容器记录对应位置的计数（s1 为对应单词数，s2 为对应字母数），然后进行比较。

这真的很低效，因为我遍历整个 s1 和 s2。如果 s1 = "red blue red red red yellow" and s2 = "abbaac"。在第三个 red 之后，基本上没有必要继续迭代它。

那么，有什么更好的办法吗？

代码：

#include "stdafx.h"
#include <iostream>
#include <string>
#include <array>
#include <sstream>
#include <vector>
#include <algorithm>
using namespace std;

vector<int> findPattern(string pattern){
    vector<int> counts;
    for (int i = 0; i < pattern.size(); ++i){
        counts.push_back(0);
        int counter = 0;
        for (int j = i + 1; j < pattern.size(); ++j){
            if (pattern[i] == pattern[j]){
                ++counter;              
            }   
            counts[i] = counter;    
        }
    }
    return counts;
}

vector<int> findPatternLong(string pattern){
    istringstream iss (pattern);
    string word;
    vector<string> v;
    while (iss >> word){
        v.push_back(word);
    }
    vector<int> counts2;
    for (int i = 0; i < v.size(); ++i){
        counts2.push_back(0);
        int counter = 0;
        for (int j = i + 1; j < v.size(); ++j){
            if (v[i] == v[j]){
                ++counter;
            }
            counts2[i] = counter;
        }
    }
    return counts2;
}

int main(int argc, char * argv[]){
    vector<int> v1 = findPattern("abbaac");
    vector<int> v2 = findPatternLong("red blue blue red red yellow");
    if (v1.size() == v2.size()){
        for (int i = 0; i < v1.size(); ++i){
            if (v1[i] != v2[i]){
                cout << "Unmatch" << endl;
                return false;
            }
        }
        cout << "match" << endl;
        return true;
    } else 
        cout << "Unmatch" << endl; 
    return 0;
}

【问题讨论】：

ideone.com/qU1Ahi

标签： c++ vector pattern-matching iteration

【解决方案1】：

@Tony 以同样的想法击败了我，但是因为我已经输入了这个，所以就这样吧:-)

首先，不要太担心效率，关注正确性：确实，过早优化是万恶之源。编写测试用例并确保您的代码通过每一个。

其次，我想我会从地图/字典 D 开始，并有一个循环，我将在其中解析每个字符串的一个元素（s1 中的一个单词，我们称之为“w”和你的 s2 中的一个字符，比如“c”），选择一个元素作为键（比如“c”字符）并检查“c”是否已经在字典中有一个条目：

如果我们同时用完元素，则字符串匹配
如果我们用完一侧的元素，我们就知道没有匹配项
如果“c”在 D 中没有条目，则存储当前值：D[c] = w;
否则，如果“c”已经有一个条目，请检查该条目是否与在字符串中找到的值匹配：是 D[c] == w？如果不是，我们就知道没有匹配项

如果该代码有效，则可以开始优化。在您的示例中，也许我们可以使用简单的数组而不是字典，因为 ASCII 字符是一个小的有限集。

【讨论】：

【解决方案2】：

这不是最有效的代码，但接近于最简单的代码：

std::map<char, std::string> letter_to_word;
std::set<std::string> words_seen;
std::istringstream iss(s1);
std::string word;
for (std::string::size_t i = 0; i < s2.size(); ++i)
{
    if (!(iss >> word))
        return false; // more letters than words
    std::string& expected_word = letter_to_word[s2[i]];
    if (expected_word == "")
    {
        // if different letters require different words...
        if (words_seen.find(word) != words_seen.end())
            return false; // multiple letters for same word
        words_seen.insert(word);

        expected_word = word; // first time we've seen letter, remember associated word
    }
    else if (expected_word != word)
        return false; // different word for same letter
}
return !(iss >> word); // check no surplus words

【讨论】：

为什么，为什么stackoverflow在保存代码缩进时遇到这么麻烦？ :-/
你能多解释一下这两行吗？ 1、std::string&映射=words[s2[i]];为什么你需要把它作为参考？ 2、return !(iss >> word);因为，你已经检查过一次，所以这一行确保单词不超过字母？
@HoKy22：感谢您指出这一点；严格来说，在字母模式是否旨在强制差异和相同的问题中没有具体说明（例如，“ab”是否可以匹配任何两个单词，或者只是两个不同的单词），但我同意在平衡上强制执行差异更直观。我将在上面添加编辑......

【解决方案3】：

你不需要两个向量。

在处理第二个字符串时，将第一个模式的计数与第一个条目进行比较。如果匹配，则继续，否则停止。对第二个字符串中的其余模式重复此操作。

您不需要存储第二个字符串的模式计数。

【讨论】：

【解决方案4】：

编辑

我刚刚读到这个问题在字符串中有模式，这个答案与比较不同类型的集合有关。如果首先转换了 2 个输入字符串，我想答案仍然存在 little :)

我不会说这是最有效的解决方案，但我喜欢它的可扩展性。

首先，有PatternResult 类。它存储模式的结果：

class PatternResult {
private:
    std::vector<int> result_;

public:
    PatternResult(const std::vector<int>& result) : result_(result) {
    };

    bool operator == (const PatternResult& rhs) const {
        if(result_.size() != rhs.result_.size()) 
            return false;
        else {
            for(std::vector<int>::size_type r(0);
                r < result_.size();
                ++r) {
                if(result_[r] != rhs.result_[r])
                    return false;
            };
            return true;
        };
    };
};  // eo class PatternResult

它需要一个整数向量，其值表示它的值。我们重载== 来比较两个模式结果，这意味着它们具有相同的序列与源数据无关。

那么我们需要一个模式计数器，它可以分配相同的序列号，但取any类型：

template<class T>
class PatternCounter {
private:
    typedef std::vector<T> vec_type;
    typedef std::map<T, int> map_type;
    map_type found_;
    int counter_;
public:
    PatternCounter() : counter_(1) {
    };

    PatternResult count(const vec_type& input ){
        std::vector<int> ret;
        for(vec_type::const_iterator cit(input.begin());
            cit != input.end();
            ++cit) {
            if(found_.find(*cit) != found_.end()) {
                ret.push_back(found_[*cit]);
            } else {
                found_[*cit] = counter_;
                ret.push_back(counter_);
                ++counter_;
            };
        };
        return PatternResult(ret);
    };
};

我们完成了。测试代码：

std::vector<std::string> inp1;
inp1.push_back("red");
inp1.push_back("blue");
inp1.push_back("blue");
inp1.push_back("red");
inp1.push_back("yellow");

std::vector<char> inp2;
inp2.push_back('a');
inp2.push_back('b');
inp2.push_back('b');
inp2.push_back('a');
inp2.push_back('c');

PatternCounter<std::string> counter1;
PatternCounter<char> counter2;

PatternResult res1(counter1.count(inp1));
PatternResult res2(counter2.count(inp2));

if(res1 == res2) {
        // pattern sequences are equal
};

请注意，这又快又脏，我相信它可以提高效率。

【讨论】：

使用 OOP 做的想法很酷，但看起来确实很长

【解决方案5】：

基本上，您要检查序列是否遵循相同的顺序。您不必担心顺序实际上是什么：第一第二第一第三就足够了。现在，您可以使用以某种方式将字符串映射到 int 的容器来执行此操作。但是，您将存储每个字符串的副本，而忽略了您并不真正关心字符串值的事实。对于微小的测试用例，这无关紧要，但对于大量的长词序列，当您不需要时，您很快就会消耗内存。

所以让我们利用我们不关心字符串值或存储它们的事实。如果是这种情况，我们可以使用哈希函数将我们的字符串转换为简单的 size_t 值，并保证它们是唯一的。但是，哈希不是连续的，我们需要根据哈希值检索序列。记录它们的序列最简单的方法是将它们映射到地图的大小以便于查找。最后一个难题是检查哈希是否在相同的序列中。

我还假设您不仅要比较一个句子和一个单词，还可能是 2 个单词或两个句子。这是一个快速的 C++11 示例，它基本上完成了上述操作，并且除非需要，否则不会在内存中保存任何内容。

当然，这仍然可以进行更多优化 - 例如，并行执行事物。

#include <iostream>
#include <vector>
#include <string>
#include <map>
#include <sstream>
/*
s1 = "red blue blue red red yellow"
s2 = "abbaac"
This would match because they have the same pattern.
*/
typedef std::map<size_t,size_t> hash_map;
typedef std::vector<std::string> wordlist;

size_t ordered_symbol( hash_map &h, std::string const& word )
{
    std::hash<std::string> hash_fn;
    size_t hash = hash_fn(word);
    if(h.find(hash)==h.end())
    {
        size_t const sequence = h.size();
        h[hash] = sequence;
        return sequence;
    }
    return h[hash];
}

wordlist create_wordlist( std::string const& str )
{
    if(str.find_first_of(' ') != std::string::npos)
    {
        wordlist w1;
        std::stringstream sstr(str);
        std::string s;
        while(sstr>>s)
            w1.push_back(s);
        return w1;        
    }
    wordlist w2;
    for(auto i : str)
    {
        std::string s;
        s.append(1,i);
        w2.push_back(s);
    }
    return w2;
}

bool pattern_matches( std::string const& s1, std::string const& s2 )
{
    wordlist const w1 = create_wordlist(s1);
    wordlist const w2 = create_wordlist(s2);
    if(w1.size()!=w2.size())
        return false;
    hash_map h1,h2;
    for( size_t i = 0; i!=w1.size(); ++i)
        if(ordered_symbol(h1,w1[i])!=ordered_symbol(h2,w2[i]))
            return false;
    return true;
}

void test( std::string const& s1, std::string const& s2 )
{
    std::cout<<"["<<s1<<"] "
             <<(pattern_matches(s1,s2)? "<==>" : "<=!=>")
             <<"["<<s2<<"]\n";    
}

int main()
{
    test("red blue blue red red yellow","abbaac");
    test("red blue blue red red yellow","first second second first first third");
    test("abbaac","12211g");
    test("abbaac","red blue blue red red yellow");
    test("abbgac","red blue blue red red yellow");
    return 0;
}

//Output:
//[red blue blue red red yellow] <==>[abbaac]
//[red blue blue red red yellow] <==>[first second second first first third]
//[abbaac] <==>[12211g]
//[abbaac] <==>[red blue blue red red yellow]
//[abbgac] <=!=>[red blue blue red red yellow]

编辑：这是一个non C++11 version，应该适用于 VS2010。但是，由于 C++03 在标准库中不包含字符串散列函数，因此该示例使用了从堆栈溢出中获取的散列函数。如果您可以访问 boost 库，则可以使用更好的散列函数 this one。

【讨论】：

我刚刚编译了你的程序，看起来在 Visual Studio 2010 的这一行上有一个致命错误：for(auto i : str)
在 Visual Studio 2012 和 ideone 上为我工作：ideone.com/T2D4xT。没看到您使用的是 VS 2010。VS 2010 不支持 C++11，所以是的，它在那里不起作用。但是，它可以在 VS 2012 以及最新版本的 Clang 和 GCC 编译器上运行。
刚刚在答案中添加了指向 C++03 示例的链接。
“相当有力的保证，他们将是独一无二的” - 嗯，不。你见过微软的std::hash<std::string>有多弱吗？例如，hash("abcdefghijklmnopqrstuvwxyz") == hash("aXcdefghijklmnopqrstuvwxyz") == hash("abXdefghijklmnopqrstuvwxyz")，因为最多包含 10 个等距字符。编写不能可靠运行的程序是丑陋的。给定许多字符串，即使是加密的 32 位也会发生冲突：机会与 #words 已经散列成线性 + 你散列的原因是“为了一个大序列”。为了节省内存，只需通过 ptr,len 引用输入字符串。
好吧，就像我提到的，您可以使用 boost::hash 或任何其他将字符串映射到整数的函数。哎呀，您可以使用从字符串到地图大小的地图 - 这也将保证唯一性，但这不是重点。为树而生。