在 flex/lex (parser-generator) 中实现单词边界状态答案

【问题标题】：Implement word boundary states in flex/lex (parser-generator)在 flex/lex (parser-generator) 中实现单词边界状态
【发布时间】：2009-01-02 14:57:36
【问题描述】：

我希望能够判断模式匹配是出现在单词字符之后还是非单词字符之后。换句话说，我想模拟 flex/lex 不支持的模式开头的 \b 分词正则表达式字符。

这是我在下面的尝试（不能按预期工作）：

%{
#include <stdio.h>
%}

%x inword
%x nonword

%%
[a-zA-Z]    { BEGIN inword; yymore(); }
[^a-zA-Z]   { BEGIN nonword; yymore(); }

<inword>a { printf("'a' in word\n"); }
<nonword>a { printf("'a' not in word\n"); }

%%

输入：

a
ba
a

预期输出

'a' not in word
'a' in word
'a' not in word

实际输出：

a
'a' in word
'a' in word

我这样做是因为我想做the dialectizer 之类的事情，而且我一直想学习如何使用真正的词法分析器。有时我要替换的模式需要是单词的片段，有时它们需要只是整个单词。

【问题讨论】：

标签： parsing lex lexical-analysis

【解决方案1】：

这就是我想要的：

%{
#include <stdio.h>
%}

WC      [A-Za-z']
NW      [^A-Za-z']

%start      INW NIW

{WC}  { BEGIN INW; REJECT; }
{NW}  { BEGIN NIW; REJECT; }

<INW>a { printf("'a' in word\n"); }
<NIW>a { printf("'a' not in word\n"); }

这样我可以在任何模式的开头或结尾处做相当于 \B 或 \b 的操作。您可以通过 a/{WC} 或 a/{NW} 在最后进行匹配。

我想在不消耗任何字符的情况下设置状态。诀窍是使用 REJECT 而不是 yymore()，我想我没有完全理解。

【讨论】：

我想你忘记了%start指令下的%%？

【解决方案2】：

%%
[a-zA-Z]+a[a-zA-Z]* {printf("a in word: %s\n", yytext);}
a[a-zA-Z]+ {printf("a in word: %s\n", yytext);}
a {printf("a not in word\n");}
. ;

测试：

user@cody /tmp $ ./a.out <<EOF
> a
> ba
> ab
> a
> EOF
a not in word

a in word: ba

a in word: ab

a not in word

【讨论】：