ANTLR4 Lexing C++11 原始字符串答案

【问题标题】：ANTLR4 Lexing C++11 Raw StringANTLR4 Lexing C++11 原始字符串
【发布时间】：2016-03-11 00:59:29
【问题描述】：

全部，

我一直在尝试根据标准文档 N4567 创建 C++ 语法，这是我能找到的最新版本。我相信语法是完整的，但我需要测试它。我试图解决的一个问题是让词法分析器从标准中识别原始字符串。我已经使用 Actions & Semantic Predicates 实现了一个可能的解决方案。我需要帮助确定它是否真的有效。我已经阅读了关于动作和谓词之间交互的 ANTLR4 参考，但可以确定我的解决方案是否有效。下面包括一个精简语法。任何想法将不胜感激。我试图在示例中包含我的想法。

grammar SampleRaw;

@lexer::members {
    string d_char_seq = "";
}

string_literal
        : ENCODING_PREFIX? '\"' S_CHAR* '\"'
        | ENCODING_PREFIX? 'R' Raw_String
        ;

ENCODING_PREFIX             //  one of
        : 'u8'
        | [uUL]
        ;

S_CHAR          /* any member of the source character set except the
                   double_quote ", backslash \, or NEW_LINE character
                 */
        : ~[\"\\\n\r]
        | ESCAPE_SEQUENCE
        | UNIV_CHAR_NAME
        ;

fragment ESCAPE_SEQUENCE
        : SIMPLE_ESCAPE_SEQ
        | OCT_ESCAPE_SEQ
        | HEX_ESCAPE_SEQ
        ;
fragment SIMPLE_ESCAPE_SEQ  // one of
        : '\\' '\''
        | '\\' '\"'
        | '\\' '?'
        | '\\' '\\'
        | '\\' 'a'
        | '\\' 'b'
        | '\\' 'f'
        | '\\' 'n'
        | '\\' 'r'
        | '\\' 't'
        | '\\' 'v'
        ;
fragment OCT_ESCAPE_SEQ
        : [0-3] ( OCT_DIGIT OCT_DIGIT? )?
        | [4-7] ( OCT_DIGIT )?
        ;
fragment HEX_ESCAPE_SEQ
        : '\\' 'x' HEX_DIGIT+
        ;
fragment UNIV_CHAR_NAME
        : '\\' 'u' HEX_QUAD
        | '\\' 'U' HEX_QUAD HEX_QUAD
        ;
fragment HEX_QUAD
        : HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
        ;
fragment HEX_DIGIT
        : [a-zA-Z0-9]
        ;
fragment OCT_DIGIT
        : [0-7]
        ;
/*
Raw_String
        : '\"' D_CHAR* '(' R_CHAR* ')' D_CHAR* '\"'
        ;
 */

Raw_String
        : ( /* CASE when D_CHAR is empty
               ACTION in D_CHAR_SEQ attempts to reset variable d_char_seq
               if it is empty, so handle it staticly
             */
            '\"' 
                '('
                    ( ~[)]       // Anything but )
                    | [)] ~[\"]  // ) Actually OK, can't be followed by "
                                 //  - )" - these are the terminating chars
                    )* 
                ')' 
            '\"'
          | '\"'
                D_CHAR_SEQ  /* Will the ACTION in D_CHAR_SEQ be an issue for
                               the Semantic Predicates Below????
                             */
                    '('
                        ( ~[)]  // Anything but )
                        | [)] D_CHAR_SEQ { ( getText() !=  d_char_seq ) }?
                                /* ) Actually OK, can't be followed D_CHAR_SEQ match
                                   IF D_CHAR_SEQs match, turn OFF the Alternative
                                 */
                        | [)] D_CHAR_SEQ { ( getText() ==  d_char_seq ) }? ~[\"]
                                /* ) Actually OK, must be followed D_CHAR_SEQ match
                                     IF D_CHAR_SEQs match, turn ON the Alternative
                                     Cant't match the final " , but
                                     WE HAVE MATCHED OUR TERMINATING CHARS
                                 */
                        )*
                    ')'
                D_CHAR_SEQ /* No need to check here,
                              Matching Terminating CHARS is only way to get out 
                              of loop above
                            */
            '\"'
          )
          { d_char_seq = ""; } // Reset Variable
        ;
/*
fragment R_CHAR
                // any member of the source character set, except a right
                // parenthesis ) followed by the initial D_CHAR*
                // (which may be empty) followed by a double quote ".
                // 
        : ~[)]
        ;
 */

fragment D_CHAR
                /* any member of the basic source character set except
                   space, the left parenthesis (, the right parenthesis ),
                   the backslash \, and the control characters representing
                    horizontal tab, vertical tab, form feed, and newline.
                 */
        : ~[ )(\\\t\v\f\n\r]
        ;
fragment D_CHAR_SEQ
        : D_CHAR+ { d_char_seq = ( d_char_seq == "" ) ? getText() : d_char_seq ; }
        ;

【问题讨论】：

希望你只是在修补。一个完整的 C++ 解析器实际上是很多 棘手的 工作：请参阅stackoverflow.com/questions/243383/…。你知道 ANTLR3 已经有部分 C++ 语法可用了吗？
是的，你是对的。一个完整的 C++ 解析器需要做很多工作，而我可能永远无法完成这项任务。然而，像我这样的齿轮头/极客需要在我的停机时间做点什么。

标签： c++ c++11 antlr antlr4 lexer

【解决方案1】：

我自己设法解决了这个问题，任何 cmets 或可能的改进将不胜感激。如果这可以在没有 ACTION 的情况下完成，那也很高兴。

一个缺点是 \" 和 D_CHAR_SEQ 是传递给解析器的 Raw_String 文本的一部分。解析器可以将它们删除，但是如果词法分析器这样做就好了。

grammar SampleRaw;

Reg_String
    : '\"' S_CHAR* '\"'
    ;
fragment S_CHAR
        /* any member of the source character set except the
           double_quote ", backslash \, or NEW_LINE character
         */
    : ~[\n\r\"\\]
    | ESCAPE_SEQUENCE
    | UNIV_CHAR_NAME
    ;
fragment ESCAPE_SEQUENCE
    : SIMPLE_ESCAPE_SEQ
    | OCT_ESCAPE_SEQ
    | HEX_ESCAPE_SEQ
    ;
fragment SIMPLE_ESCAPE_SEQ  // one of
    : '\\' '\''
    | '\\' '\"'
    | '\\' '?'
    | '\\' '\\'
    | '\\' 'a'
    | '\\' 'b'
    | '\\' 'f'
    | '\\' 'n'
    | '\\' 'r'
    | '\\' 't'
    | '\\' 'v'
    ;
fragment OCT_ESCAPE_SEQ
    : [0-3] ( OCT_DIGIT OCT_DIGIT? )?
    | [4-7] ( OCT_DIGIT )?
    ;
fragment OCT_DIGIT
    : [0-7]
    ;
fragment HEX_ESCAPE_SEQ
    : '\\' 'x' HEX_DIGIT+
    ;
fragment HEX_DIGIT
    : [a-zA-Z0-9]
    ;
fragment UNIV_CHAR_NAME
    : '\\' 'u' HEX_QUAD
    | '\\' 'U' HEX_QUAD HEX_QUAD
    ;
fragment HEX_QUAD
    : HEX_DIGIT HEX_DIGIT HEX_DIGIT HEX_DIGIT
    ;

Raw_String
    : 'R'
      '\"'              // Match Opening Double Quote
      ( /* Handle Empty D_CHAR_SEQ without Predicates
           This should also work
           '(' .*? ')'
         */
        '(' ( ~')' | ')'+ ~'\"' )* (')'+)

      | D_CHAR_SEQ
            /*  // Limit D_CHAR_SEQ to 16 characters
               { ( ( getText().length() - ( getText().indexOf("\"") + 1 ) ) <= 16 ) }?
            */
        '('
        /* From Spec :
           Any member of the source character set, except
           a right parenthesis ) followed by the initial D_CHAR_SEQUENCE
           ( which may be empty ) followed by a double quote ".

         - The following loop consumes characters until it matches the
           terminating sequence of characters for the RAW STRING
         - The options are mutually exclusive, so Only one will
           ever execute in each loop pass
         - Each Option will execute at least once.  The first option needs to
           match the ')' character even if the D_CHAR_SEQ is empty. The second
           option needs to match the closing \" to fall out of the loop. Each
           option will only consume at most 1 character
         */
        (   //  Consume everthing but the Double Quote
          ~'\"'
        |   //  If text Does Not End with closing Delimiter, consume the Double Quote
          '\"'
          {
               !getText().endsWith(
                    ")"
                  + getText().substring( getText().indexOf( "\"" ) + 1
                                       , getText().indexOf( "(" )
                                       )
                  + '\"'
                )
          }?
        )*
      )
      '\"'              // Match Closing Double Quote

      /*
      // Strip Away R"D_CHAR_SEQ(...)D_CHAR_SEQ"
      //  Send D_CHAR_SEQ <TAB> ... to Parser
      {
        setText( getText().substring( getText().indexOf("\"") + 1
                                    , getText().indexOf("(")
                                    )
               + "\t"
               + getText().substring( getText().indexOf("(") + 1
                                    , getText().lastIndexOf(")")
                                    )
               );
      }
       */
    ;
fragment D_CHAR_SEQ     // Should be limited to 16 characters
    : D_CHAR+
    ;
fragment D_CHAR
        /*  Any member of the basic source character set except
            space, the left parenthesis (, the right parenthesis ),
            the backslash \, and the control characters representing
            horizontal tab, vertical tab, form feed, and newline.
         */
    : '\u0021'..'\u0023'
    | '\u0025'..'\u0027'
    | '\u002a'..'\u003f'
    | '\u0041'..'\u005b'
    | '\u005d'..'\u005f'
    | '\u0061'..'\u007e'
    ;
ENCODING_PREFIX         //  one of
    : 'u8'
    | [uUL]
    ;
WhiteSpace
    : [ \u0000-\u0020\u007f]+ -> skip
    ;
start
    : string_literal* EOF
    ;
string_literal
    : ENCODING_PREFIX? Reg_String
    | ENCODING_PREFIX? Raw_String
    ;

【讨论】：