如何使用缩进作为带有 bison 和 flex 的块分隔符答案

【问题标题】：How to use indentation as block delimiters with bison and flex如何使用缩进作为带有 bison 和 flex 的块分隔符
【发布时间】：2010-11-27 15:29:08
【问题描述】：

我知道如何在 bison + flex 中实现缩进作为块分隔符。就像在 python 中一样。我正在编写自己的编程语言（主要是为了好玩，但我打算将它与游戏引擎一起使用），我会尝试想出一些特别的东西来最大限度地减少样板并最大限度地提高开发速度。

我已经用 C 编写了一个编译器（实际上是一个 `langToy' 到 Nasm 的翻译器），但是失败了。由于某种原因，它只能处理整个源文件中的一个字符串（好吧，我已经醒了超过 48 小时 - 所以......你知道，大脑崩溃了）。

我不知道大括号和/或开始 -> 结束是否更容易实现（我这样做没有问题）或者只是我的大脑被锁定了。

提前致谢！

更新： 好的，我不知道如何使用 flex 来实现。我在将多个 DEDENT 返回到解析器时遇到问题。 Flex/Bison 对我来说相对较新。

更新 2： 这是我到目前为止提出的 flex 文件；它不太明白：

%x t
%option noyywrap

%{
  int lineno = 0, ntab = 0, ltab = 0, dedent = 0;
%}

%%

<*>\n  { ntab = 0; BEGIN(t); }
<t>\t  { ++ntab; }
<t>.   { int i; /* my compiler complains not c99 if i use for( int i=0... */
         if( ntab > ltab )
           printf("> indent >\n");
         else if( ntab < ltab )
           for( i = 0; i < ltab - ntab; i++ )
             printf("< dedent <\n");
         else
           printf("=        =\n");

         ltab = ntab; ntab = 0;
         BEGIN(INITIAL);
         /* move to next rule */
         REJECT;}
.    /* ignore everything else for now */

%%

main()
{
  yyin = fopen( "test", "r" );
  yylex();
}

您可以尝试使用它，也许您会看到我缺少的东西。在 Haxe (return t_dedent(num);) 中返回多个 dents 会很容易。

此代码并不总是正确匹配缩进/缩进。

更新 3：我认为我会放弃对 flex 的希望并按照自己的方式做，如果有人知道如何在 flex 中做到这一点，我很乐意听到。 p>

【问题讨论】：

标签： compiler-construction bison flex-lexer

【解决方案1】：

Chris 的回答对一个可用的解决方案大有帮助，非常感谢！不幸的是，它缺少一些我需要的更重要的方面：

同时进行多个缩进（未缩进）。考虑以下代码应该在调用 baz 后发出两个外凹：
```
def foo():
  if bar:
    baz()
```
当到达文件末尾并且仍处于某个缩进级别时发出缩进。
不同大小的缩进级别。 Chris 当前的代码仅适用于 1 个空格的缩进。

根据 Chris 的代码，我想出了一个解决方案，它适用于我迄今为止遇到的所有情况。我在 github 上创建了一个模板项目，用于使用 flex（和 bison）解析基于缩进的文本：https://github.com/lucasb-eyer/flex-bison-indentation。这是一个完整的（基于 CMake 的）项目，它还跟踪当前标记的行位置和列范围。

以防万一链接由于某种原因而中断，这里是词法分析器的核心：

#include <stack>

int g_current_line_indent = 0;
std::stack<size_t> g_indent_levels;
int g_is_fake_outdent_symbol = 0;

static const unsigned int TAB_WIDTH = 2;

#define YY_USER_INIT { \
    g_indent_levels.push(0); \
    BEGIN(initial); \
}
#include "parser.hh"

%}

%x initial
%x indent
%s normal

%%
    int indent_caller = normal;

 /* Everything runs in the <normal> mode and enters the <indent> mode
    when a newline symbol is encountered.
    There is no newline symbol before the first line, so we need to go
    into the <indent> mode by hand there.
 */
<initial>.  { set_yycolumn(yycolumn-1); indent_caller = normal; yyless(0); BEGIN(indent); }
<initial>\n { indent_caller = normal; yyless(0); BEGIN(indent); }    

<indent>" "     { g_current_line_indent++; }
<indent>\t      { g_current_line_indent = (g_current_line_indent + TAB_WIDTH) & ~(TAB_WIDTH-1); }
<indent>\n      { g_current_line_indent = 0; /* ignoring blank line */ }
<indent><<EOF>> {
                    // When encountering the end of file, we want to emit an
                    // outdent for all indents currently left.
                    if(g_indent_levels.top() != 0) {
                        g_indent_levels.pop();

                        // See the same code below (<indent>.) for a rationale.
                        if(g_current_line_indent != g_indent_levels.top()) {
                            unput('\n');
                            for(size_t i = 0 ; i < g_indent_levels.top() ; ++i) {
                                unput(' ');
                            }
                        } else {
                            BEGIN(indent_caller);
                        }

                        return TOK_OUTDENT;
                    } else {
                        yyterminate();
                    }
                }

<indent>.       {
                    if(!g_is_fake_outdent_symbol) {
                        unput(*yytext);
                    }
                    g_is_fake_outdent_symbol = 0;
                    // -2: -1 for putting it back and -1 for ending at the last space.
                    set_yycolumn(yycolumn-1);

                    // Indentation level has increased. It can only ever
                    // increase by one level at a time. Remember how many
                    // spaces this level has and emit an indentation token.
                    if(g_current_line_indent > g_indent_levels.top()) {
                        g_indent_levels.push(g_current_line_indent);
                        BEGIN(indent_caller);
                        return TOK_INDENT;
                    } else if(g_current_line_indent < g_indent_levels.top()) {
                        // Outdenting is the most difficult, as we might need to
                        // outdent multiple times at once, but flex doesn't allow
                        // emitting multiple tokens at once! So we fake this by
                        // 'unput'ting fake lines which will give us the next
                        // outdent.
                        g_indent_levels.pop();

                        if(g_current_line_indent != g_indent_levels.top()) {
                            // Unput the rest of the current line, including the newline.
                            // We want to keep it untouched.
                            for(size_t i = 0 ; i < g_current_line_indent ; ++i) {
                                unput(' ');
                            }
                            unput('\n');
                            // Now, insert a fake character indented just so
                            // that we get a correct outdent the next time.
                            unput('.');
                            // Though we need to remember that it's a fake one
                            // so we can ignore the symbol.
                            g_is_fake_outdent_symbol = 1;
                            for(size_t i = 0 ; i < g_indent_levels.top() ; ++i) {
                                unput(' ');
                            }
                            unput('\n');
                        } else {
                            BEGIN(indent_caller);
                        }

                        return TOK_OUTDENT;
                    } else {
                        // No change in indentation, not much to do here...
                        BEGIN(indent_caller);
                    }
                }

<normal>\n    { g_current_line_indent = 0; indent_caller = YY_START; BEGIN(indent); }

【讨论】：

我的代码为每个的缩进空间生成一个 INDENT/UNINDENT。因此，对于具有 2 个空格缩进的示例，它将在第一行之后生成两个 INDENT 标记，在第二行之后生成另一个 2，最后生成 4 个 UNINDENT。所以你需要让你的解析器“忽略”额外的冗余 INDENT/UNINDENT 对。如果您想正确捕捉尾随减少的缩进，那么在词法分析器中折叠它们是很困难的，但如果您不关心这一点，您可以使用一堆缩进级别而不是单个计数器。

【解决方案2】：

您需要做的是让 flex 计算每行开头的空白数量，并插入适当数量的 INDENT/UNINDENT 标记以供解析器用于对事物进行分组。一个问题是你想对制表符和空格做些什么——你只是想让它们与固定的制表位等效，还是你想要求缩进保持一致（所以如果一行以制表符开头，下一行使用空格，表示错误，这可能有点困难）。

假设您想要固定的 8 列制表位，您可以使用类似

%{
/* globals to track current indentation */
int current_line_indent = 0;   /* indentation of the current line */
int indent_level = 0;          /* indentation level passed to the parser */
%}

%x indent /* start state for parsing the indentation */
%s normal /* normal start state for everything else */

%%
<indent>" "      { current_line_indent++; }
<indent>"\t"     { current_line_indent = (current_line_indent + 8) & ~7; }
<indent>"\n"     { current_line_indent = 0; /*ignoring blank line */ }
<indent>.        {
                   unput(*yytext);
                   if (current_line_indent > indent_level) {
                       indent_level++;
                       return INDENT;
                   } else if (current_line_indent < indent_level) {
                       indent_level--;
                       return UNINDENT;
                   } else {
                       BEGIN normal;
                   }
                 }

<normal>"\n"     { current_line_indent = 0; BEGIN indent; }
... other flex rules ...

您必须确保以缩进模式开始解析（以获取第一行的缩进）。

【讨论】：

好像你明白了，但我希望制表位计为 2 个空格。所以我猜该行应该是 current_line_indent = (current_line_indent + 2) & ~1;
是的——当你看到一个制表符时，你需要将 current_line_indent 加到下一个制表位。

【解决方案3】：

您需要一个看起来与此类似的规则（假设您使用制表符作为缩进）：

\t: {return TABDENT; }

坦率地说，我总是发现大括号（或开始/结束）更容易编写和阅读，无论是作为人类还是作为词法分析器/解析器编写器。

【讨论】：

好吧，似乎用特殊的块开始和块结束符号编写词法分析器至少更容易。在我的本地化键盘上写 { 和 } 不更容易 ;D

【解决方案4】：

大括号（等等）只有在您使用去除所有空格的标记器（仅用于分隔标记）时才会更简单。有关 python 标记化的一些想法，请参阅this page（“编译器如何解析缩进？”部分）。

如果您在解析之前没有进行标记化，那么可能还有其他工作要做，这取决于您构建解析器的方式。

【讨论】：

感谢有用的链接，我会破解一下，看看这次能不能成功。