Boost Spirit 词法分析器状态交叉授粉答案

【问题标题】：Boost Spirit lexer states cross pollinateBoost Spirit 词法分析器状态交叉授粉
【发布时间】：2014-12-30 12:59:15
【问题描述】：

我正在尝试使用词法分析器状态来进行特定于上下文的解析，但似乎不同的词法分析器状态会交叉授粉。这是一个非常基本的例子

#include <boost/config/warning_disable.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix_operator.hpp>
#include <boost/spirit/include/phoenix_container.hpp>

#include <iostream>
#include <string>

using namespace boost::spirit;

template <typename Lexer>
struct strip_comments_tokens : lex::lexer<Lexer>
{
    strip_comments_tokens() 
      : strip_comments_tokens::base_type(lex::match_flags::match_default)
    {
        ccomment = "\\/\\*";
        endcomment = ".*\\*\\/";
        hello = "hello";

        this->self.add
            (ccomment)
            (hello);

        this->self("COMMENT").add
            (endcomment);
    }

    lex::token_def<> ccomment, endcomment;
    lex::token_def<std::string> hello;
};

template <typename Iterator>
struct strip_comments_grammar : qi::grammar<Iterator>
{
    template <typename TokenDef>
    strip_comments_grammar(TokenDef const& tok)
      : strip_comments_grammar::base_type(start)
    {
        start =  *(   tok.ccomment 
                      >>  qi::in_state("COMMENT") 
                      [
                          tok.endcomment 
                      ]
              |   tok.hello [ std::cout << _1 ]
        );
    }

    qi::rule<Iterator> start;
};


int main(int argc, char* argv[])
{
    typedef std::string::iterator base_iterator_type;

    typedef 
        lex::lexertl::lexer<lex::lexertl::token<base_iterator_type> > 
    lexer_type;

    typedef strip_comments_tokens<lexer_type>::iterator_type iterator_type;

    strip_comments_tokens<lexer_type> strip_comments;           // Our lexer
    strip_comments_grammar<iterator_type> g (strip_comments);   // Our parser 

    std::string str("hello/*hello*/hello");
    base_iterator_type first = str.begin();

    bool r = lex::tokenize_and_parse(first, str.end(), strip_comments, g);

    return 0;
}

我希望输入

"hello/*hello*/hello"

被标记为 hello ccomment endcomment hello。但是发生的情况是输入被标记为 hello ccomment hello，因此语法停止工作。如果将输入更改为

"hello/*anything else*/hello"

一切都按预期进行。

有什么想法吗？

【问题讨论】：

从[spirit-general] mailing list 交叉授粉 :)

标签： c++ boost boost-spirit boost-spirit-qi boost-spirit-lex

【解决方案1】：

您永远不会修改词法分析器的状态。所以它始终处于"INITIAL" 状态。

设置词法分析器状态应在词法分析器阶段完成（根据我的经验和大量实验，没有可靠的方法从解析器阶段获得反馈）。

因此您需要升级到actor_lexer 并将语义操作附加到添加到词法分析器表中的token_defs：

typedef 
    lex::lexertl::actor_lexer<lex::lexertl::token<base_iterator_type> > 
lexer_type;

和

this->self += 
     ccomment [ lex::_state = "COMMENT" ]
   | hello;

this->self("COMMENT") += 
    endcomment [ lex::_state = "INITIAL" ];

也就是说，我想完全跳过标记要容易得多。如果您真的想知道如何使用 Lexer 状态进行跳过，请参阅：

Boost.Spirit SQL grammar/lexer failure

我建议使用 lex::_pass = lex::pass_flags::pass_ignore 的 Simplify And Profit 方法：

这是我的看法：

Live On Coliru

#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/phoenix.hpp>
#include <boost/spirit/include/qi.hpp> // for the parser expression *strip_comments.hello

namespace lex = boost::spirit::lex;
namespace phx = boost::phoenix;

template <typename Lexer>
struct strip_comments_tokens : lex::lexer<Lexer> {
    strip_comments_tokens() 
      : strip_comments_tokens::base_type(lex::match_flags::match_default)
    {
        ccomment   = "\\/\\*.*\\*\\/";
        hello      = "hello"; // why not "."?

        this->self += 
             ccomment [ lex::_pass = lex::pass_flags::pass_ignore ]
  // IDEA: | lex::token_def<char>(".") // to just accept anything
           | hello
           ;
    }

    lex::token_def<lex::omit>   ccomment;
    lex::token_def<std::string> hello;
};

int main() {
    typedef std::string::const_iterator base_iterator_type;
    typedef lex::lexertl::actor_lexer<
                lex::lexertl::token<base_iterator_type/*, boost::mpl::vector<char, std::string>, boost::mpl::false_*/>
            > lexer_type;

    strip_comments_tokens<lexer_type> strip_comments;         // Our lexer

    std::string const str("hello/*hello*/hello");
    std::string stripped;

    base_iterator_type first = str.begin();
    bool r = lex::tokenize_and_parse(first, str.end(), strip_comments, *strip_comments.hello, stripped);

    if (r)
        std::cout << "\nStripped: '" << stripped << "'\n";
    else
        std::cout << "Failed: '" << std::string(first, str.end()) << "'\n";
}

【讨论】：

我的意图是从外部改变词法分析器的状态，因为我的解析是特定于上下文的，词法分析器并不总是知道如何解释输入流。我相信这就是 qi::in_state 存在的原因。目前我认为除了完全摆脱词法分析器并将所有词法分析的东西放入语法之外别无他法，但它的解决方案太麻烦了。
@AntonAutushka 嗯。我刚刚用相关链接更新了我的答案（尤其是第一个链接答案上的 cmets）。关于“繁琐”，我的强烈经验是，使用带有 Boost Spirit 的词法分析器会使一切变得更加繁琐。我的指导是：非常确定你/需要/它以及为什么。
您的代码完美地修复了我这个丑陋的小例子中的错误。但这不是我需要的。我需要 in_state 工作 :) 为了更加真实，请考虑这两个家伙“x = /b/g”和“x = a/b/g”。第一个是 JavaScrsipt 正则表达式，第二个是普通的算术表达式。而且您无法在词法分析器级别上区分彼此。因此我的情况。
这只是该语法（和类似语言）中众所周知的扫描仪边缘情况。我会让词法分析器不关心它。您可以从仅标记化中获得大量性能优势（在语法级别，如果表达式以 / 开头，则决定它是一个正则表达式）。如果您还没有发现性能是一个问题，我肯定会考虑不进行词法分析。
如果你想解析完整的 ECMAScript 语言，我会说 1. 不要自己滚动 2. 不要假装“敏捷”使用精神。只需使用 ANTLR、flex、CoCo/C++、...，最好使用现有的语法定义。