为什么我的有限状态机需要这么长时间才能执行？答案

【问题标题】：Why does my finite state machine take so long to execute?为什么我的有限状态机需要这么长时间才能执行？
【发布时间】：2010-03-19 02:00:16
【问题描述】：

我正在研究一个状态机，它应该提取表单的函数调用

/* I am a comment */
//I am a comment
pref("this.is.a.string.which\"can have QUOTES\"", 123456);

提取的数据在哪里pref("this.is.a.string.which\"can have QUOTES\"", 123456); 从一个文件。目前，要处理一个 41kb 的文件，这个过程需要将近一分半钟的时间。我对这个有限状态机有什么严重误解吗？

#include <boost/algorithm/string.hpp>
std::vector<std::string> Foo()
{
    std::string fileData;
    //Fill filedata with the contents of a file
    std::vector<std::string> results;
    std::string::iterator begin = fileData.begin();
    std::string::iterator end = fileData.end();
    std::string::iterator stateZeroFoundLocation = fileData.begin();
    std::size_t state = 0;
    for(; begin < end; begin++)
    {
        switch (state)
        {
        case 0:
            if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {
                stateZeroFoundLocation = begin;
                begin += 4;
                state = 2;
            } else if (*begin == '/')
                state = 1;
            break;
        case 1:
            state = 0;
            switch (*begin)
            {
            case '*':
                begin = boost::find_first(boost::make_iterator_range(begin, end), "*/").end();
                break;
            case '/':
                begin = std::find(begin, end, L'\n');
            }
            break;
        case 2:
            if (*begin == '"')
                state = 3;
            break;
        case 3:
            switch(*begin)
            {
            case '\\':
                state = 4;
                break;
            case '"':
                state = 5;
            }
            break;
        case 4:
            state = 3;
            break;
        case 5:
            if (*begin == ',')
                state = 6;
            break;
        case 6:
            if (*begin != ' ')
                state = 7;
            break;
        case 7:
            switch(*begin)
            {
            case '"':
                state = 8;
                break;
            default:
                state = 10;
                break;
            }
            break;
        case 8:
            switch(*begin)
            {
            case '\\':
                state = 9;
                break;
            case '"':
                state = 10;
            }
            break;
        case 9:
            state = 8;
            break;
        case 10:
            if (*begin == ')')
                state = 11;
            break;
        case 11:
            if (*begin == ';')
                state = 12;
            break;
        case 12:
            state = 0;
            results.push_back(std::string(stateZeroFoundLocation, begin));
        };
    }
    return results;
}

比利3

编辑：嗯，这是我见过的最奇怪的事情之一。我刚刚重建了这个项目，它又可以正常运行了。奇怪。

【问题讨论】：

标签： c++ performance finite-automata

【解决方案1】：

除非您的 41 kb 文件主要是 cmets 或 prefs，否则它将大部分时间处于状态 0。对于处于状态 0 的每个字符，您至少需要进行两次函数调用。

if (boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

你可以通过预测试来加快这个速度，看看当前字符是否是 'p'

if (*begin == 'p' && boost::starts_with(boost::make_iterator_range(begin, end), "pref(")) {

如果字符不是“p”，则不需要进行任何函数调用。特别是不创建迭代器，这可能是花费时间的地方。

【讨论】：

【解决方案2】：

我不知道这是否是问题的一部分，但您在 case 0 中有一个错字，“perf”被拼写为“pref”。

【讨论】：

实际上我的问题拼错了——状态机产生正确的输出，它只需要永远:(（谢谢+1）

【解决方案3】：

好吧，仅通过查看很难说...但我猜查找算法正在这样做。为什么要在 FSM 中搜索？根据定义，您应该一次给他们一个字符....添加更多状态。还可以尝试将结果设为列表，而不是向量。大量的复制正在进行

vector<string>

但主要是：剖析它！

【讨论】：

不知何故，我不认为这是 find 函数，因为我的程序中的其他所有内容都在使用 boost 的字符串算法，而没有过多的运行时间。考虑到所有的查找算法都在增加begin，我看不出用更多的状态替换它们会如何加快速度。
我也不认为它是向量，因为我可能从这个文件中得到了 50 个结果——这么多结果不应该花几分钟时间。在函数顶部添加results.reserve(2000); 也无济于事。

【解决方案4】：

有限状态机是一种可行的解决方案，但对于文本处理，最好使用高度优化的有限状态机生成器。在这种情况下，一个正则表达式。这是 Perl 正则表达式：

# first clean the comments
$source =~ s|//.*$||;      # replace "// till end of line" with nothing
$source =~ s|/\*.*?\*/||s; # replace "/* any text until */" with nothing
                           # depending on your data, you may need a few other
                           # rules here to avoid blanking data, you could replace
                           # the comments with a unique identifier, and then
                           # expand any identifiers that the regex below returns

# then find your data
while ($source =~ /perf\(\s*"(.+?)",\s*(\d+)\s*\);/g) { 
   # matches your function signature and moves along source
   # do something with the captured groups, in this case $1 and $2
}

由于大多数正则表达式库都与 Perl 兼容，因此翻译语法应该不难。如果您的搜索变得更复杂，则需要使用解析器。

【讨论】：

对于像这样的简单案例，添加正则表达式库不是一个选项。这是整个程序中唯一这样的情况，很难证明 200kb+ 的正则表达式代码可以用单个 FSM 完成。
虽然大小当然是一个问题，但确保您的状态机处理所有边缘情况（包含看起来像注释的内容的数据、间距差异、多行调用......）可以很难。由于您不需要许多高级正则表达式功能，我想有足够小的库。即使没有，将您的解决方案建模为正则表达式，然后转换为 C++ 可能会更容易
“即使没有，将您的解决方案建模为正则表达式，然后转换为 C++ 可能会更容易”
听起来不错，很高兴你的工作正常了，知道是什么导致了你的编译错误吗？
我不同意。我当然不会建议任何愿意解析某些东西的人使用正则表达式。当然它在一开始可能会起作用，但是当需求增加时，你最终会变成一个没有人有机会理解的笨重的巨兽……然后人们要求嵌套，你就进入了一个受伤的世界。

【解决方案5】：

如果你在做解析，为什么不使用解析器库。

我通常会想到Boost.Spirit.Qi。

您可以使用类似 EBNF 的表达式来表达您的语法，这无疑更易于维护。
它是一个仅包含标头的库，因此您可以毫无问题地将整个二进制文件混入其中。

虽然我可以欣赏极简主义的方法，但恐怕您自己编写有限状态机的想法并不那么明智。它适用于一个玩具示例，但随着要求的增加，您将拥有一个可怕的switch，并且理解发生的事情将变得越来越复杂。

请不要告诉我你知道它不会进化：我不相信神谕；）

【讨论】：

正如我告诉 Eric Storm 的那样，我不能证明像这样一个庞大的库（如精神）是为了完成这项特定任务的一次性功能。如果我有其他要求，那么我会考虑这个。它可能只是标题，但仍然使用了很多样板代码